System, method and computer program product for identifying chemical compounds having desired properties

ABSTRACT

An automatic, partially automatic, and/or manual iterative system, method and/or computer program product for generating chemical entities having desired or specified physical, chemical, functional, and/or bioactive properties. The present invention identifies a set of compounds for analysis; collects, acquires or synthesizes the identified compounds; analyzes the compounds to determine one or more physical, chemical and/or bioactive properties (structure-property data); and uses the structure-property data to identify another set of compounds for analysis in the next iteration. An Experiment Planner generates Selection Criteria and/or one or more Objective Functions for use by a Selector. The Selector searches the Compound Library to identify a subset of compounds (a Directed Diversity Library) that maximizes or minimizes the Objective Functions. The compounds listed in the Directed Diversity Library are then collected, acquired or synthesized, and are analyzed to evaluate their properties of interest. In one embodiment, when a compound in a Directed Diversity Library is available in a Chemical Inventory, the compound is retrieved from the Chemical Inventory instead of re-synthesizing the compound. The Analysis Module receives the compounds of the Directed Diversity Library from the Chemical Inventory and/or the Synthesis Module, analyzes the compounds and outputs Structure-Property data. The Structure-Property data is provided to the Experiment Planner and is also stored in the Structure-Property database. The Experiment Planner defines one or more new Selection Criteria and/or Objective Functions for the next iteration of the invention. In one embodiment, a Structure-Property Model Generator generates Structure-Property Models and provides them to the Experiment Planner which uses the Models to generate subsequent Selection Criteria and/or Objective Function.

CROSS REFERENCE TO RELATED APPLICATIONS

[0001] This application is related to commonly owned U.S. provisionalpatent application No. 60/030,187, filed Nov. 4, 1996.

BACKGROUND OF THE INVENTION

[0002]1. Field of the Invention

[0003] The present invention relates generally to the generation ofchemical entities with defined physical, chemical and/or bioactiveproperties, and more particularly, to iterative selection and testing ofchemical entities.

[0004]2. Related Art

[0005] Conventionally, new chemical entities with useful properties aregenerated by identifying a chemical compound (called a “lead compound”)with some desirable property or activity, creating variants of the leadcompound, and evaluating the property and activity of those variantcompounds. Examples of chemical entities with useful properties includepaints, finishes, plasticizers, surfactants, scents, flavorings, andbioactive compounds, but can also include chemical compounds with anyother useful property that depends upon chemical structure, composition,or physical state. Chemical entities with desirable biologicalactivities include drugs, herbicides, pesticides, veterinary products,etc. There are a number of flaws with this conventional approach to leadgeneration, particularly as it pertains to the discovery of bioactivecompounds.

[0006] One deficiency pertains to the first step of the conventionalapproach, i.e., the identification of lead compounds. Traditionally, thesearch for lead compounds has been limited to an analysis of compoundbanks, for example, available commercial, custom, or natural productschemical libraries. Consequently, a fundamental limitation of theconventional approach is the dependence upon the availability, size, andstructural diversity of these chemical libraries. Although chemicallibraries cumulatively total an estimated 9 million identifiedcompounds, they reflect only a small sampling of all possible organiccompounds with molecular weights less than 1200. Moreover, only a smallsubset of these libraries is usually accessible for biological testing.Thus, the conventional approach is limited by the relatively small poolof previously identified chemical compounds which may be screened toidentify new lead compounds.

[0007] Also, compounds in a chemical library are traditionally screened(for the purpose of identifying new lead compounds) using a combinationof empirical science and chemical intuition. However, as stated by RudyM. Baum in his article “Combinatorial Approaches Provide Fresh Leads forMedicinal Chemistry,” C&EN, Feb. 7, 1994, pages 20-26, “chemicalintuition, at least to date, has not proven to be a particularly goodsource of lead compounds for the drug discovery process.”

[0008] Another deficiency pertains to the second step of theconventional approach, i.e., the creation of variants of lead compounds.Traditionally, lead compound variants are generated by chemists usingconventional chemical synthesis procedures. Such chemical synthesisprocedures are manually performed by chemists. Thus, the generation oflead compound variants is very labor intensive and time consuming. Forexample, it typically takes many chemist years to produce even a smallsubset of the compound variants for a single lead compound. Baum, in thearticle referenced above, states that “medicinal chemists, usingtraditional synthetic techniques, could never synthesize all of thepossible analogs of a given, promising lead compound.” Thus, the use ofconventional, manual procedures for generating lead compound variantsoperates to impose a limit on the number of compounds that can beevaluated as new drug leads. Overall, the traditional approach to newlead generation is an inefficient, labor-intensive, time consumingprocess of limited scope.

[0009] Recently, attention has focused on the use of combinatorialchemical libraries to assist in the generation of new chemical compoundleads. A combinatorial chemical library is a collection of diversechemical compounds generated by either chemical synthesis or biologicalsynthesis by combining a number of chemical “building blocks” such asreagents. For example, a linear combinatorial chemical library such as apolypeptide library is formed by combining a set of chemical buildingblocks called amino acids in every possible way for a given compoundlength (i.e., the number of amino acids in a polypeptide compound).Millions of chemical compounds theoretically can be synthesized throughsuch combinatorial mixing of chemical building blocks. For example, onecommentator has observed that the systematic, combinatorial mixing of100 interchangeable chemical building blocks results in the theoreticalsynthesis of 100 million tetrameric compounds or 10 billion pentamericcompounds (Gallop et al., “Applications of Combinatorial Technologies toDrug Discovery, Background and Peptide Combinatorial Libraries,” J. Med.Chem. 37, 1233-1250 (1994)).

[0010] To date, most work with combinatorial chemical libraries has beenlimited only to peptides and oligonucleotides for the purpose ofidentifying bioactive agents; little research has been performed usingnon-peptide, non-nucleotide based combinatorial chemical libraries. Ithas been shown that the compounds in peptide and oligonucleotide basedcombinatorial chemical libraries can be assayed to identify ones havingbioactive properties. However, there is no consensus on how suchcompounds (identified as having desirable bioactive properties anddesirable profile for medicinal use) can be used.

[0011] Some commentators speculate that such compounds could be used asorally efficacious drugs. This is unlikely, however, for a number ofreasons. First, such compounds would likely lack metabolic stability.Second, such compounds would be very expensive to manufacture, since thechemical building blocks from which they are made most likely constitutehigh priced reagents. Third, such compounds would tend to have a largemolecular weight, such that they would have bioavailability problems(i.e., they could only be taken by injection).

[0012] Others believe that the compounds from a combinatorial chemicallibrary that are identified as having desirable biological propertiescould be used as lead compounds. Variants of these lead compounds couldbe generated and evaluated in accordance with the conventional procedurefor generating new bioactive compound leads, described above. However,the use of combinatorial chemical libraries in this manner does notsolve all of the problems associated with the conventional leadgeneration procedure. Specifically, the problem associated with manuallysynthesizing variants of the lead compounds is not resolved.

[0013] In fact, the use of combinatorial chemical libraries to generatelead compounds exacerbates this problem. Greater and greater diversityhas often been achieved in combinatorial chemical libraries by usinglarger and larger compounds (that is, compounds having a greater numberof variable subunits, such as pentameric compounds instead of tetramericcompounds in the case of polypeptides). However, it is more difficult,time consuming, and costly to synthesize variants of larger compounds.Furthermore, the real issues of structural and functional groupdiversity are still not directly addressed; bioactive agents such asdrugs and agricultural products possess diversity that could never beachieved with available peptide and oligonucleotide libraries since theavailable peptide and oligonucleotide components only possess limitedfunctional group diversity and limited topology imposed through theinherent nature of the available components. Thus, the difficultiesassociated with synthesizing variants of lead compounds are exacerbatedby using typical peptide and oligonucleotide combinatorial chemicallibraries to produce such lead compounds. The issues described above arenot limited to bioactive agents but rather to any lead generatingparadigm for which a chemical agent of defined and specific activity isdesired.

[0014] Additional drawbacks to conventional systems are described inU.S. Pat. No. 5,574,656, titled, “System and Method of AutomaticallyGenerating Chemical Compounds with Desired Properties,” issued Nov. 12,1996, incorporated herein in its entirety by reference.

[0015] Thus, the need remains for a system and method for efficientlyand effectively generating new leads designed for specific utilities.

SUMMARY OF THE INVENTION

[0016] The present invention is an automatic, partially automatic,and/or manual iterative system, method and/or computer program productfor generating chemical entities having desired or specified physical,chemical, functional, and/or bioactive properties. The present inventionis also directed to the chemical entities produced by this system,method and/or computer program product. In an embodiment, the followingsteps are performed during each iteration:

[0017] (1) identify a set of compounds for analysis;

[0018] (2) collect, acquire or synthesize the identified compounds;

[0019] (3) analyze the compounds to determine one or more physical,chemical and/or bioactive properties (structure-property data); and

[0020] (4) use the structure-property data to identify another set ofcompounds for analysis in the next iteration.

[0021] For purposes of illustration, the present invention is describedherein with respect to the production of drug leads. However, thepresent invention is not limited to this embodiment.

[0022] In one embodiment, the system and computer program productincludes an Experiment Planner, a Selector, a Synthesis Module and anAnalysis Module. The system also includes one or more databases, such asa Structure-Property database, a Compound Database, a Reagent databaseand a Compound Library.

[0023] The Experiment Planner receives, among other things, HistoricalStructure-Property data from the Structure-Property database and currentStructure-Property data that was generated by the Analysis Module duringa prior iteration of the invention.

[0024] The Experiment Planner generates Selection Criteria for use bythe Selector. One or more of the Selection Criteria can be combined intoone or more Objective Functions. An Objective Function describes thecollective ability of a given subset of compounds from the CompoundLibrary to simultaneously satisfy all the prescribed Selection Criteria.An Objective Function defines the influence of each Selection Criterionin the final selection. The Selection Criteria and the exact form of theObjective Function can be specified by a human operator or can beautomatically generated by a computer program or other process, or canbe specified via human/computer interaction.

[0025] The one or more Selection Criteria and/or Objective Functions canrepresent: one or more desired characteristics that the resultingcompounds should possess, individually or collectively; one or moreundesired characteristics that the resulting compounds should notpossess, individually or collectively; and/or one or more constraintsthat exclude certain compounds and/or combinations of compounds in orderto limit the scope of the selection. The Selection Criteria can be inthe form of mathematical functions or computer algorithms, and can becalculated using a digital computer.

[0026] The Selector receives the Selection Criteria and ObjectiveFunctions and searches the Compound Library to identify a subset ofcompounds that maximizes or minimizes the Objective Functions. TheCompound Library can be a collection of pre-existing or virtual chemicalcompounds.

[0027] The Selector identifies a smaller subset of these compounds,referred to herein as a Directed Diversity Library, based on one or moreSelection Criteria and/or Objective Functions. The number of compoundsin this subset can be specified by the operator or can be determinedautomatically or partially automatically within any limits specified bythe operator.

[0028] The Selection Criteria can be applied either simultaneously orsequentially. For example, in one embodiment, one part of the DirectedDiversity Library can be selected based on a first set of Criteriaand/or Objective Function, while another part of that Directed DiversityLibrary can be selected based on a second set of Selection Criteriaand/or Objective Function.

[0029] The compounds comprising the Directed Diversity Library are thencollected, acquired or synthesized, and are analyzed to evaluate theirphysical, chemical and/or bioactive properties of interest. In oneembodiment, when a compound in a Directed Diversity Library is availablein a Chemical Inventory, the compound is retrieved from the ChemicalInventory. This avoids unnecessary time and expense of synthesizing acompound that is already available. Compounds that are not availablefrom a Chemical Inventory are synthesized in the Synthesis Module.

[0030] In one embodiment, the Synthesis Module is an automated roboticmodule that receives synthesis instructions from a Synthesis ProtocolGenerator. Alternatively, synthesis can be performed manually orsemi-automatically.

[0031] The Analysis Module receives the compounds of the DirectedDiversity Library from the Chemical Inventory and/or the SynthesisModule. The Analysis Module analyzes the compounds and outputsStructure-Property data. The Structure-Property data is provided to theExperiment Planner and is also stored in the Structure-Propertydatabase.

[0032] The Experiment Planner defines one or more new Selection Criteriaand/or Objective Functions for the next iteration of the invention. Thenew Selection Criteria and/or Objective Functions can be defined throughoperator input, through an automated process, through a partiallyautomated process, or any combination thereof.

[0033] In one embodiment, current and historical Structure-Property dataare provided to an optional Structure-Property Model Generator. TheStructure-Property data can include structure-property activity datafrom all previous iterations or from a subset of all previousiterations, as specified by user input, for example.

[0034] The Structure-Property Model Generator generatesStructure-Property Models that conform to the observed data. TheStructure-Property Models are provided to the Experiment Planner whichuses the Models to generate subsequent Selection Criteria and/orObjective Function. The Selection Criteria and/or Objective Functionsare provided to the Selector which selects the next Directed DiversityLibrary therefrom.

[0035] In one embodiment, the functions of the Experiment Planner, theSelector and the optional Synthesis Protocol Generator are performed byautomated machines under the control of one or more computer programsexecuted on one or more processors and/or human operators.Alternatively, one or more of the functions of the Experiment Planner,the Selector and the optional Synthesis Protocol Generator can beperformed manually.

[0036] The functions of the Synthesis Module and the Analysis Module canbe performed manually, robotically, or by any combination thereof.

[0037] Further features and advantages of the present invention, as wellas the structure and operation of various embodiments of the presentinvention, are described in detail below with reference to theaccompanying drawings. In the drawings, like reference numbers indicateidentical or functionally similar elements. Also, the leftmost digit(s)of the reference numbers identify the drawings in which the associatedelements are first introduced.

BRIEF DESCRIPTION OF THE FIGURES

[0038] The present invention will be described with reference to theaccompanying drawings, wherein:

[0039]FIG. 1 is a flow diagram depicting the flow of data and materialsamong elements of a lead generation system, in accordance with thepresent invention;

[0040]FIG. 2 is a flow diagram depicting the flow of data and materialsamong elements of an embodiment of the lead generation system, inaccordance with the present invention;

[0041]FIG. 3 is a block diagram of the lead generation system, inaccordance with the present invention;

[0042]FIG. 4 is a block diagram of an analysis module that can beemployed by the lead generation system illustrated in FIG. 3;

[0043]FIG. 5 is a block diagram of a structure-property database thatcan be employed by the lead generation system illustrated in FIG. 3;

[0044]FIG. 6 is a process flowchart illustrating an iterative method foridentifying chemical compounds having desired properties;

[0045]FIG. 7 is a process flowchart illustrating a method for performingsteps 612 and 614 of the method illustrated in FIG. 6;

[0046]FIG. 8 is a flow diagram depicting the flow of data among elementsof a structure-property model generator that can be employed by a leadgeneration system;

[0047]FIG. 9 is an illustration of a generalized regression neuralnetwork model that can be generated by the structure-property modelgenerator illustrated in FIG. 8 and that can employ a K-Nearest-Neighborclassifiers;

[0048]FIG. 10 is a flow diagram depicting the flow of data amongelements of a fuzzy structure-property model than can be generated bythe structure-property model generator illustrated in FIG. 8;

[0049]FIG. 11 is a Neuro-Fuzzy structure-property model that can begenerated by the structure-property model generator illustrated in FIG.8;

[0050]FIG. 12 is a flow diagram depicting the flow of data among anexperiment planner and a selector in a lead generation system;

[0051]FIG. 13 is a flow diagram depicting the flow of data duringselection of a directed diversity library;

[0052]FIG. 14 illustrates a distribution of compounds in a directeddiversity library;

[0053]FIG. 15 illustrates another distribution of compounds in adirected diversity library;

[0054]FIG. 16 illustrates another distribution of compounds in adirected diversity library;

[0055]FIG. 17 is a process flowchart illustrating a method forgenerating structure-property models in accordance with the presentinvention;

[0056]FIG. 18 is a process flowchart illustrating a method for selectinga directed diversity library, in accordance with the present invention;and

[0057]FIG. 19 is a block diagram of a computer system that can be usedto implement one or more portions of the lead generation systemillustrated in FIG. 3.

DETAILED DESCRIPTION OF THE PREFERRED EMBODIMENTS Table of Contents

[0058] 1. General Overview

[0059] 2. Example Environment

[0060] 3. Structure-Property Models

[0061] a. Statistical Models

[0062] b. Neural Network Models

[0063] i. Generalized Regression Neural Networks

[0064] c. Fuzzy Logic Models

[0065] d. Hybrid Models

[0066] i. Neuro-fuzzy Models

[0067] e. Model-Specific Methods

[0068] i. Docking Models

[0069] ii. 3D QSAR Models

[0070] 4. Experiment Planner 130

[0071] a. Selection Criteria 104

[0072] i. First Type of Selection Criteria 104

[0073] ii. Second Type of Selection Criteria 104

[0074] b. Objective Functions 105

[0075] 5. Selector 106

[0076] 6. Structure of the Present Invention

[0077] 7. Operation of the Present Invention

[0078] 8. Conclusions

[0079] 1. General Overview

[0080] The present invention is an iterative system, method and computerprogram product for generating chemical entities having desiredphysical, chemical and/or bioactive properties. The present inventioniteratively selects, analyzes and evaluates Directed Diversity Librariesfor desired properties. The present invention can be implemented as afully or partially automated, computer-aided robotic system, or withoutany robotics. The present invention is also directed to the chemicalentities generated by operation of the present invention.

[0081] Conventional systems perform combinatorial chemical synthesis andanalysis of static compound libraries. This tends to be scattershot andrandom, essentially constituting a “needle in a haystack” researchparadigm.

[0082] In contrast, the present invention employs a dynamic CompoundLibrary. The Compound Library is dynamic in that the compoundscomprising the Compound Library can change from one iteration of thepresent invention to the next. The dynamic Compound Library can expandand/or contract.

[0083] The Compound Library includes chemical compounds that alreadyexist and/or chemical compounds that can be synthesized on demand,either individually or combinatorially. The Compound Library can be acombinatorial chemical library, a set of combinatorial chemicallibraries and/or non-combinatorial chemical libraries. However, theCompound Library is not limited to a combinatorial chemical library.

[0084] Instead of searching and analyzing the whole Compound Library,the present invention identifies and analyzes particular subsets of theCompound Library. These subsets of the Compound Library are referred toherein as Directed Diversity Libraries. As opposed to conventionaltechniques, Directed Diversity Libraries provide an optimizationapproach that is focused and directed.

[0085] 2. Example Environment

[0086] Referring to the flow diagram in FIG. 1, a leadgeneration/optimization system 100 includes an Experiment Planner 130, aSelector 106, a Synthesis Module 112 and an Analysis Module 118. Thesystem also includes one or more databases, such as: aStructure-Property database 126, a Compound Database 134, a Reagentdatabase 138 and a Compound Library 102.

[0087] The Selector 106 receives Selection Criteria 104 from theExperiment Planner 130. The Selector 106 can also receive one or moreObjective Functions 105 from the Experiment Planner 130.

[0088] The Selection Criteria 104 represent desired or undesiredcharacteristics that the resulting compounds should or should notpossess, either individually or collectively, and/or constraints thatexclude certain compounds and/or combinations of compounds. TheSelection Criteria 104 can be in the form of mathematical functions orcomputer algorithms, and can be calculated using a digital computer.

[0089] One or more of the Selection Criteria 104 can be combined intoone or more Objective Functions 105 by the Experiment Planner 130. TheObjective Functions 105 describe the extent to which a given set ofcompounds should satisfy all the prescribed Selection Criteria 104. TheObjective Functions 105 can define the influence of each SelectionCriterion 104 in the selection of a Directed Diversity Library. TheSelection Criteria 104 and the exact form of the Objective Functions 105can be specified by a human operator or can be automatically orsemi-automatically generated (with human input) by the ExperimentPlanner 130.

[0090] The Selector 106 searches the Compound Library 102 to identifyone or more subsets of compounds that maximize or minimize the SelectionCriteria 104 and/or Objective Function 105. The subset of compounds isreferred to herein as a Directed Diversity Library 108. Note that theDirected Diversity Library 108 is a list of compounds. These compoundsmay or may not already exist (i.e., they may or may not be in theChemical Inventory 110). The properties of the Directed DiversityLibrary 108 of compounds are generally hitherto unknown. The number ofcompounds in a Directed Diversity Library can be specified by theoperator, or can be determined automatically within any limits specifiedby the operator.

[0091] The Selection Criteria 104 can be applied either simultaneouslyor sequentially. For example, in one embodiment of the presentinvention, one part of the Directed Diversity Library 108 can beselected based on a given set of Selection Criteria 104 and/or ObjectiveFunction 105, while another part of that Directed Diversity Library 108can be selected based on a different set of Selection Criteria 104and/or Objective Function 105. Thus, the present invention represents amulti-objective property refinement system, in the sense that one ormore Selection Criteria 104 can be used, and one or more ObjectiveFunctions 105 can be pursued, during each iteration.

[0092] Compounds from the Directed Diversity Libraries 108 are providedto the Analysis Module 118 for analysis. Alternatively, the compoundscan be manually analyzed or partially manually analyzed and partiallyautomatically analyzed. In one embodiment, one or more compounds in aDirected Diversity Library 108 that have previously been synthesized areretrieved from a Chemical Inventory 110 instead of being synthesizedagain. This saves time and costs associated with re-synthesizing theselected compounds. The Chemical Inventory 110 represents any source ofavailable compounds including, but not limited to, a corporate chemicalinventory, a supplier of commercially available chemical compounds, anatural product collection, etc.

[0093] A system and computer program product that determines whether acompound in a Directed Diversity Library 108 exists in the ChemicalInventory 110 can be implemented within the Selector Module 102, theSynthesis Module 112 or in any other module. For example, the SelectorModule 106 can include instructions for searching the Chemical Inventory110 to identify and retrieve any previously synthesized compoundstherefrom that are listed in the Directed Diversity Library 108 (or asubset of the Directed Diversity Library 108, as determined by userinput, for example).

[0094] Compounds in the Directed Diversity Library 108 that are notretrieved from the Chemical Inventory 110 are synthesized individuallyor combinatorially by the Synthesis Module 112. The Synthesis Module 112can retrieve and selectively combine Reagents 114 from the ReagentInventory 116, in accordance to a prescribed chemical synthesisprotocol.

[0095] In one embodiment, the Synthesis Module 112 is used torobotically synthesize compounds. As used herein, the term “robotically”refers to any method that involves an automated or partially automateddevice that performs functions specified by instructions that theSynthesis Module 112 receives from the operator or some other componentof the system of the present invention.

[0096] For example, refer to FIG. 2, which is similar to FIG. 1, butwhich illustrates a synthesis protocol generator 202 in the path to theSynthesis Module 112. The Synthesis Protocol Generator 202 providesRobotic Synthesis Instructions 204 to the Synthesis Module 112. TheSynthesis Protocol Generator 202 receives a list of compounds in theDirected Diversity Library 108 to be synthesized. The Synthesis ProtocolGenerator 202 extracts, under computer control, Reagent Data 136 from aReagent Database 138, and generates Robotic Synthesis Instructions 204that will enable the Synthesis Module 112 to automatically or partiallyautomatically synthesize the compounds in the Directed Diversity Library108.

[0097] The Robotic Synthesis Instructions 204 identify Reagents 114 froma Reagent Inventory 116 that are to be mixed by the Synthesis Module112. The Robotic Synthesis Instructions 204 also identify the manner inwhich such Reagents 114 are to be mixed by the Synthesis Module 112. Forexample, the Robotic Synthesis Instructions 204 can specify whichReagents 114 are to be mixed together. The Robotic SynthesisInstructions 204 can also specify chemical and/or physical conditions,such as temperature, length of time, stirring, etc. for mixing of thespecified Reagents 114.

[0098] In one embodiment, compounds from the Directed Diversity Library108 are manually synthesized and then delivered to the Analysis Module118 for analysis.

[0099] In one embodiment, a Compound Library 102 includes a singlecombinatorial chemical library that can be synthesized fromapproximately one hundred commercially available reagents that aresuitable for generating thrombin inhibitors. Preferably, the SynthesisModule 112 combines these reagents using well-known synthetic chemistrytechniques to synthesize inhibitors of the enzyme thrombin. Eachinhibitor is generally composed of, but not restricted to, threechemical building blocks. Thus, the Directed Diversity Library 108preferably comprises a plurality of thrombin inhibitors generallycomposed of, but not restricted to, three sites of variable structure(i.e. trimers).

[0100] The present invention, however, is not limited to this thrombinexample. One skilled in the art will recognize that Compound Library 102can include many other types of libraries. For example, the presentinvention is equally adapted and intended to generate other chemicalcompounds having other desired properties, such as paints, finishes,plasticizers, surfactants, scents, flavorings, bioactive compounds,drugs, herbicides, pesticides, veterinary products, etc., and/or leadcompounds for any of the above. In fact, the present invention cangenerate chemical compounds having any useful properties that depend upstructure, composition, or state.

[0101] As noted above, the compounds in the Directed Diversity Library108, after being synthesized or retrieved from the Chemical Inventory110, are provided to the Analysis Module 118 for analysis. Analysis caninclude chemical, biochemical, physical, and/or biological analysis.

[0102] Preferably, the Analysis Module 118 assays the compounds in theDirected Diversity Library 108 to obtain, for example, enzyme activitydata, cellular activity data, toxicology data, and/or bioavailabilitydata. Optionally, the Analysis Module 118 analyzes the compounds toidentify which of the compounds were adequately synthesized and which ofthe compounds were not adequately synthesized. The Analysis Module 118further analyzes the compounds to obtain other pertinent data, such asstructure and electronic structure data.

[0103] The Analysis Module 118 also classifies any compounds thatpossess the Desired Properties 120 as Leads (lead compounds) 122.Alternatively, this function can be performed by another module such as,for example, the Experiment planner 130 or the Selector Module 106.

[0104] Analysis can be performed automatically, manually orsemi-automatically/semi-manually.

[0105] The Analysis Module 118 generates Structure-Property Data 124 forthe analyzed compounds. Structure-Property Data 124 can includestructure-property and/or structure-activity data. For example,Structure-Property Data 124 can include physical data, synthesis data,enzyme activity data, cellular activity data, toxicology data,bioavailability data, etc. The Structure-Property Data 124 is stored ina Structure-Property Database 126. The Structure-Property Data 124 isalso provided to the Experiment Planner 130.

[0106] The Experiment Planner 130 receives current Structure-PropertyData 124 from the Analysis Module 118 and Historical Structure-PropertyData 128 from the Structure-Property Database 126. HistoricalStructure-Property Data 128 can include well known structure-property orstructure-activity relationship data, collectively referred to asStructure-Property Relationships or SPR, pertaining to one or morerelationships between the properties and activities of a compound andthe chemical structure of the compound.

[0107] The Experiment Planner 130 also receives Compound Data 132 fromthe Compound Database 134, Reagent Data 136 from Reagent Database 138and Desired Properties 120. Desired Properties 120 can be sent from anautomated system or database (not shown) or from user input. In oneembodiment, the Experiment Planner 130 also receives one or moreStructure-Property Models 192 from one or more optionalStructure-Property Model Generators 190. The Experiment Planner 130 usesthe above inputs to generate one or more Selection Criteria 104 andObjective Functions 105.

[0108] Compound Data 132 and Reagent Data 136 permit the ExperimentPlanner 130 to include, for example, one or more of the followingcriteria in the Selection Criteria 104:

[0109] (1) the molecular diversity of a given set of compounds (as usedherein, molecular diversity refers to a collective propensity of a setof compounds to exhibit a variety of a prescribed set of structural,physical, chemical and/or biological characteristics);

[0110] (2) the molecular similarity of a given compound or set ofcompounds with respect to one or more reference compounds (typicallyknown leads);

[0111] (3) the cost of a given compound or set of compounds if thesecompounds are to be retrieved from the Chemical Inventory 110, or thecost of the Reagents 114 if the compound(s) are to be synthesized by theSynthesis Module 112;

[0112] (4) the availability of a given compound or set of compounds fromthe Chemical Inventory 110, or the availability of the Reagents 114 ifthe compound(s) are to be synthesized by the Synthesis Module 112;

[0113] (5) the predicted ease of synthesis of a given compound or set ofcompounds if these compound(s) are to be synthesized by the SynthesisModule 112;

[0114] (6) the predicted yield of synthesis of a given compound or setof compounds if these compound(s) are to be synthesized by the SynthesisModule 112;

[0115] (7) the method of synthesis of a given compound or set ofcompounds if these compound(s) are to be synthesized by the SynthesisModule 112;

[0116] (8) the predicted ability of a given compound or set of compoundsto fit a receptor binding site;

[0117] (9) the predicted ability of a given compound or set of compoundsto bind selectively to a receptor binding site;

[0118] (10) the predicted ability of a given compound or set ofcompounds to fit a 3-dimensional receptor map model;

[0119] (11) the predicted bioavailability of a given compound or set ofcompounds as determined by one or more bioavailability models;

[0120] (12) the predicted toxicity of a given compound or set ofcompounds as determined by one or more toxicity models; and/or

[0121] (13) other selection criteria that can be derived frominformation pertaining to a given compound or set of compounds and thatcan be used to guide the selection of the Directed Diversity Library 108for the next iteration of the system of the present invention.

[0122] The optional Structure-Property Models 192 can be used by theExperiment Planner 130 to predict the properties of compounds in theCompound Library 102 whose real properties are hitherto unknown. TheStructure-Property Models 192 are used by the Experiment Planner 130 todefine and/or refine a set of Selection Criteria 104 that depend uponthe predictions of one or more Structure-Property Models 192.

[0123] Structure-Property Models 192 permit the Experiment Planner 130to include one or more of the following in Selection Criteria 104:

[0124] (1) the predicted ability of a given compound or set of compoundsto exhibit one or more desired properties as predicted by one or morestructural-property models;

[0125] (2) the predicted ability of a given compound or set of compoundsto test the validity of one or more Structure-Property Models; and/or

[0126] (3) the predicted ability of a given compound or set of compoundsto discriminate between two or more Structure-Property Models (one ormore Structure-Property models can be tested and evaluated in parallel).

[0127] The functionality of the Experiment Planner 130 can be achievedby an automated or partially automated process, or by a trainedoperator, aided or not by a computer. Further details ofStructure-Property Models 192 are provided below.

[0128] The one or more new Selection Criteria 104 and ObjectiveFunctions 105 are sent to the Selector 106 which uses them to select anew Directed Diversity Library 108 for the next iteration of the presentinvention.

[0129] Thus, in summary, the compounds in the new Directed DiversityLibrary 108 are retrieved from the Chemical Inventory 110 and/orsynthesized by the Synthesis Module 112. The Analysis Module 118analyzes the new Directed Diversity Library 108 to obtainStructure-Property Data 124 pertaining to the compounds in the newDirected Diversity Library 108. The Experiment Planner 130 analyzes thenew Structure-Property Data 124, Historical Structure-Property Data 128,and any of Compound Data 132, Reagent Data 136, Desired Properties 120and Structure-Property Models 192, to identify a new set of SelectionCriteria 104. The new set of Selection Criteria 104 can be used by theSelector 106 to select yet another Directed Diversity Library 108 foranother iteration.

[0130] Thus, the present invention is an iterative system, method and/orcomputer program product for generating chemical entities, including newchemical entities, having a set of physical, chemical, and/or biologicalproperties optimized towards a prescribed set of targets. During eachiteration, a Directed Diversity Library 108 is generated, the compoundsin the Directed Diversity Library 108 are analyzed, Structure-PropertyModels are optionally derived and elaborated, a list of SelectionCriteria 104 are defined, and a new Directed Diversity Library 108 isselected for the next iteration.

[0131] Preferably, elements of the present invention are controlled by adata processing device (with or without operator input, intervention orcontrol), such as a computer operating in accordance with software.Consequently, it is possible in the present invention to store massiveamounts of data, and to utilize this data in a current iteration togenerate Selection Criteria 104 for the next iteration.

[0132] In particular, since the elements of the present invention arecontrolled by a data processing device, it is possible to store theStructure-Property Data 124 obtained during each iteration. It is alsopossible to utilize the Historical Structure-Property Data 128 obtainedduring previous iterations, as well as other pertinentstructure-property data obtained by other experiments, to generateSelection Criteria 104 for the next iteration. In other words, theselection of the Directed Diversity Library 108 for the next iterationis guided by the results of all previous iterations (or any subset ofthe previous iterations, as determined by user input, for example).Thus, the present invention “learns” from its past performance such thatthe present invention is “intelligent”. As a result, the Leads 122identified in subsequent iterations are better (i.e. exhibit physical,chemical, and/or biological properties closer to the prescribed values)that the Leads 122 identified in prior iterations.

[0133] In one embodiment of the present invention, the Compound Library102 includes one or more combinatorial chemical libraries, comprisedexclusively of compounds that can be synthesized by combining a set ofchemical building blocks an a variety of combinations. According to thisembodiment, the Synthesis Module 112 is used to robotically synthesizethe Directed Diversity Library 108 during each iteration.

[0134] The integrated use of data processing devices (i.e. theExperiment Planner 130, the Selector 106, the Synthesis ProtocolGenerator 202, the Synthesis Module 112, and the Analysis Module 118) inthe present invention enables the automatic or semi-automatic andintelligent synthesis and screening of very large numbers of chemicalcompounds.

[0135] Additional details of the Structure-Property Models 192,Selection Criteria 104, Objective Functions 105, Experiment Planner 130and the Selector 106 are now provided.

[0136] 3. Structure-Property Models 192

[0137] In one embodiment of the present invention, one or moreStructure-Property Model Generators 190 generate Structure-PropertyModels 192 that conform to observed data. The Structure-Property Models192 are used by the Experiment planner 130 to generate SelectionCriteria 104 and/or Objective Functions 105.

[0138] Referring to FIG. 8, one embodiment of a Structure-Property ModelGenerator 190 is illustrated as Structure-Property Model Generator 800.The Structure-Property Model Generator 800 defines a Model Structure 820based on Statistics 802, Neural Networks 804, Fuzzy Logic 806, and/orother Model-Specific Methods 808.

[0139] Model-Specific Methods 808 refer to methods that are specific tothe application domain of the model. Examples of such Model-SpecificMethods 808 are methods that compute the energy of a particularmolecular conformation or receptor-ligand complex such as an empiricalforce field or a quantum-mechanical method, methods that align the3-dimensional structures of two or more chemical compounds based ontheir shape, electronic fields and/or other criteria, methods thatpredict the affinity and binding conformation of a ligand to aparticular receptor binding site, methods that construct receptor modelsbased on the 3-dimensional structures of known ligands, etc. Examples ofsuch Model-Specific Methods 808 are described in greater detail below.

[0140] The Model Structure 820 can combine elements of Statistics 802,Neural Networks 804, Fuzzy Logic 806, and/or Model-Specific Methods 808.Such Model Structures 820 are hereafter referred to as Hybrid ModelStructures or Hybrid Models. An example of such a Hybrid ModelArchitecture 820 is a Model Architecture that combines elements ofNeural Networks 804 and Fuzzy Logic 806, hereafter referred to as aNeuro-Fuzzy Model Architecture or Neuro-Fuzzy Model. An example of aNeuro-Fuzzy Model Architecture is discussed in greater detail below.

[0141] One embodiment of a Structure-Property Model Generator 800includes a Trainer 822 that generates one or more Structure-PropertyModels 842 for a given Model Architecture 820. The Trainer 822 optimizesa particular Model Structure 820 using selected Structure-Property Data124 and 128 from the Structure-Property Database 126, as determined byuser input, for example. Preferably, the Trainer 822 optimizes the ModelStructure 820 by minimizing the error between the actual properties ofselected compounds, as determined by the Analysis Module 118(Structure-Property Data 124, 128), and the predicted properties of thecompounds as determined by the Structure-Property Model 842. The erroris referred to hereafter as the Structure-Property Prediction Error orPrediction Error.

[0142] The process of minimizing the Prediction Error shall hereafter bereferred to as Training. Preferably, the Trainer 822 minimizes thePrediction Error using a search/optimization method such as GradientMinimization 832, Monte-Carlo Sampling 834, Simulated Annealing 836,Evolutionary Programming 838, and/or a Genetic Algorithm 840.Alternatively, the Trainer 822 minimizes the Prediction Error using ahybrid search/optimization method that combines elements of GradientMinimization 832, Monte-Carlo Sampling 834, Simulated Annealing 836,Evolutionary Programming 838, and/or a Genetic Algorithm 840. An exampleof a hybrid method is a method that combines Simulated Annealing 836with Gradient Minimization 832. Another example of a hybrid method is amethod that combines Monte-Carlo Sampling 834 with Gradient Minimization832. Examples of such methods are described in greater detail below.

[0143] Preferably, the Structure-Property Data 124, 128 are divided intoStructure Data 824 and Property Data 828. Structure Data 824 andProperty Data 828 are preferably encoded as Encoded Structure Data 826and Encoded Property Data 830. Encoding should be of a form that isappropriate for the particular Model Structure 820. The EncodedStructure Data 826 and Encoded Property Data 830 are used by the Trainer822 to derive one or more final Structure-Property Models 842. TheTrainer 822 can employ Gradient Minimization 832, Monte-Carlo Sampling834, Simulated Annealing 836, Evolutionary Programming 838, and/or aGenetic Algorithm 840. The Trainer 822 trains the Model Structure 820using a suitably encoded version of the Structure-Property Data 124,128, or a selected subset of the Structure-Property Data 124, 128, asdetermined by user-input, for example.

[0144] The Trainer 822 generates one or more Structure-Property Models842 for a given Model Structure 820. In one embodiment,Structure-Property Models 842 are represented as a linear combination ofbasis functions of one or more molecular features (descriptors). Thedescriptors collectively represent the Encoded Structure Data 826.

[0145] To illustrate the present invention, several example embodimentsand implementations of the Structure-Property Model Generator 800 shallnow be discussed in detail. These examples are provided to illustratethe present invention. The present invention is not limited to theseexamples.

[0146] a. Statistical Models

[0147] A Statistical Module 802 can define a Statistical Model Structure820. When the trainer optimizes the Statistical Model Structure 820, theresultant Structure-Property Model 842 is referred to as a StatisticalStructure-Property Model 842.

[0148] In one embodiment, Structure-Property Models 192 are representedas a linear combination of basis functions of one or more molecularfeatures (descriptors). The descriptors can include topological indices,physicochemical properties, electrostatic field parameters, volume andsurface parameters, etc. The number of descriptors can range from a fewtens to tens of thousands. For example, the descriptors can include, butare not limited to, molecular volume and surface areas, dipole moments,octanol-water partition coefficients, molar refractivities, heats offormation, total energies, ionization potentials, molecular connectivityindices, substructure keys, hashed fingerprints, atom pairs and/ortopological torsions, atom layers, 2D and 3D auto-correlation vectors,3D structural and/or pharmacophoric keys, electronic fields, etc.

[0149] Such descriptors and their use in the fields of QuantitativeStructure-Activity Relationships (QSAR) and molecular diversity arereviewed in Kier, L. B. and Hall L. H., Molecular Connectivity inChemistry and Drug Research, Academic Press, New York (1976); Kier, L.B. and Hall L. H., Molecular Connectivity in Structure-ActivityAnalysis, Research Studies Press, Wiley, Letchworth (1986); Kubinyi, H.,Methods and Principles in Medicinal Chemistry, Vol. 1, VCH, Weinheim(1993); and Agrafiotis, D. K., Encyclopedia of Computational Chemistry,Wiley (in press), the contents of which are incorporated herein byreference.

[0150] In one embodiment, the coefficients of the linear combination ofthe basis functions of Statistical Structure-Property Models 842 aredetermined using linear regression techniques. If many features areused, linear regression can be combined with principle componentanalysis, factor analysis, and/or multi-dimensional scaling. These arewell known techniques for reducing the dimensionality and extracting themost important features from a large table.

[0151] In one embodiment, the basis functions and/or features used bythe Trainer 822 to optimize the Statistical Structure-Property Models842 are selected using Monte-Carlo Sampling 834, Simulated Annealing836, Evolutionary Programming 838, and/or a Genetic Algorithm 840. Amethod for selecting the basis functions and/or features using a GeneticAlgorithm 840, known as a genetic function approximation (GFA), isdescribed in Rogers and Hopfinger, J. Chem. Inf. Comput. Sci., 34: 854(1994) incorporated herein by reference in its entirety.

[0152] In the GFA algorithm, a Structure-Property Model 842 isrepresented as a linear string that encodes the features and basisfunctions employed by the model. A population of linearly encodedStructure-Property Models 842 is then initialized by a random process,and allowed to evolve through the repeated application of geneticoperators, such as crossover, mutation and selection. Selection is basedon the relative fitness of the models, as measured by a least-squareserror procedure, for example. Friedman's lack-of-fit algorithm,described in J. Friedman, Technical Report No. 100, Laboratory forComputational Statistics, Department of Statistics, Stanford University,Stanford, Calif., November 1988, herein incorporated by reference in itsentirety, or other suitable metrics well known to persons skilled in theart, can also be used. GFA can build models using linear polynomials aswell as higher-order polynomials, splines and Gaussians. Uponcompletion, the procedure yields a population of models, rankedaccording to their fitness score.

[0153] Another method for selecting basis functions and/or features isdescribed in Luke, J. Chem. Info. Comput. Sci., 34: 1279 (1994),incorporated herein by reference in its entirety. This method is similarto the GFA method of Rogers and Hopfinger described above, but usesEvolutionary Programming 838 instead of a Genetic Algorithm 840 tocontrol the evolution of the population of models.

[0154] Alternatively, the basis functions and/or features can beselected using a Monte-Carlo Sampling 834 or Simulated Annealing 834technique. In this embodiment, an initial model is generated at random,and is gradually refined by a series of small stochastic ‘steps’. Here,the term ‘step’ is taken to imply a stochastic (random or semi-random)modification of the model's underlying structure.

[0155] As in the GFA algorithm, the model in this embodiment is alsodefined as a linear combination of basis functions, whose coefficientsare determined by linear regression. During each step, the model ismodified by making a ‘small’ stochastic step. For example, the model canbe modified by inserting a new basis function, by removing an existingbasis function, by modifying an existing basis function (i.e. bymodifying one or more of the features and/or parameters associated withthat particular basis function), and/or by swapping features and/orparameters between two (compatible) basis functions.

[0156] The quality of the model is assessed using a least-squares errorcriterion. Alternatively, Friedman's lack-of-fit criterion, or any othersuitable error criterion can be used. At the end of each step, the newmodel is compared to the old model using the Metropolis criterion.Alternatively, any other suitable comparison criterion can be used. Ifthe new model is approved, it replaces the old model and the process isrepeated. If the new model is not approved, the old model is retained asthe current model, and the process is repeated. This general process iscontrolled by a Monte-Carlo Sampling protocol 834, a Simulated Annealingprotocol 836, or variants thereof, which are well known to the peopleskilled in the art.

[0157] During the training process, the Trainer 822 can be configured toretain a list of models according to some predefined criteria. Forexample, the Trainer 822 can be configured to retain the ten bestStructure-Property Models 842 discovered during the simulation.Alternatively, the Trainer 822 can be configured to retain the ten bestStructure-Property Models 842 discovered during the simulation, whichdiffer from each other by some predetermined amount. The differencebetween two models can be defined ‘genotypically’ or ‘phenotypically’. A‘genotypical’ comparison between two models involves a comparison oftheir underlying structure (i.e. the basis functions and/or coefficientsused to represent the Structure-Property Models 842). Conversely, a‘phenotypical’ comparison between two models involves a comparison basedon their respective predictions.

[0158] b. Neural Network Models

[0159] The Structure-Property Model Generator 800 can generateStructure-Property Models 842 based on Neural Networks 804. NeuralNetworks 804 are physical cellular systems that can acquire, store, andutilize experimental knowledge. Neural Networks 804 are extensivelyreviewed in Haykin, Neural Networks. A Comprehensive Foundation,MacMillan, New York (1994), incorporated herein by reference in itsentirety.

[0160] As in the functional models described above, Structure Data 824can be encoded using one or more molecular features (descriptors).Molecular features collectively represent the Encoded Structure Data826. Molecular features can include topological indices, physicochemicalproperties, electrostatic field parameters, volume and surfaceparameters, etc., and their number can range from a few tens to tens ofthousands. For example, these features can include, but are not limitedto, molecular volume and surface areas, dipole moments, octanol-waterpartition coefficients, molar refractivities, heats of formation, totalenergies, ionization potentials, molecular connectivity indices,substructure keys, hashed fingerprints, atom pairs and/or topologicaltorsions, atom layers, 2D and 3D auto-correlation vectors, 3D structuraland/or pharmacophoric keys, electronic fields, etc. If many features areused, neural network training can be combined with principle componentanalysis, factor analysis, and/or multi-dimensional scaling, which arewell known techniques for reducing the dimensionality and extracting themost important features from a large table.

[0161] One embodiment of a Neural Network Model Structure 820 is aMulti-Layer Feed-Forward Neural Network or Multi-Layer Perceptron,trained using the error back-propagation algorithm. Alternatively, theMulti-Layered Perceptron can be trained using Monte-Carlo Sampling 834,Simulated Annealing 836, Evolutionary Programming 838, and/or a GeneticAlgorithm 840. In general, Neural Network training is the process ofadjusting the number of neurons, synaptic weights, and/or transferfunctions in the input, output and hidden layers of the Neural Network,so that the overall prediction error is minimized. Many variants of suchtraining algorithms have been reported, and are well known to thoseskilled in the art.

[0162] As in the functional models described above, the Trainer 822 canbe configured to retain more than one Neural Network Models 842 duringthe training phase (flow arrow 890 in FIG. 8). For example, the Trainer822 can be configured to retain the ten best Neural Network Models 842discovered during the training phase. Alternatively, the Trainer 822 canbe configured to retain the ten best Neural Network Models 842discovered during training, which differ from each other by somepredetermined amount. Again, the difference between two models can bedefined ‘genotypically’ or ‘phenotypically’, i.e. by comparing themodels based either on their internal structure, or their predictions.

[0163] i. Generalized Regression Neural Networks

[0164] Another embodiment of a Neural Network Model Structure 820 is aGeneralized Regression Neural Network Model Structure (or GeneralizedRegression Neural Network). Generalized Regression Neural Networks aredescribed in Specht, D. IEEE Trans. Neural Networks, 2(6): 568 (1991),and Masters, T., Advanced Algorithms for Neural Networks, Wiley (1995),incorporated herein by reference.

[0165] An example of a Generalized Regression Neural Network 900 isshown in FIG. 9. A Generalized Regression Neural Network 900 iscomprised of four layers of neurons (units). The first layer is theInput Layer 902, the second layer is the Pattern Layer 904, the thirdlayer is the Summation Layer 906, and the fourth layer is the OutputLayer 908, which is comprised of a single unit.

[0166] The Pattern Layer 904 contains one unit per input-output pair orstructure-property pair (referred to hereafter as a Training Case). Thecollection of all Training Cases used in the Pattern Layer 904 ishereafter referred to as the Training Set. In the example shown in FIG.9, there are four Training Cases. The input vector (or input case, whichin the example shown in FIG. 9 consists of 3 variables) issimultaneously presented to all units in the Pattern Layer 904. Each ofthese units computes a distance measure separating the Training Caserepresented by that unit from the input case. This distance is acted onby the transfer function associated with that unit, to compute theoutput of that particular unit. The transfer function is also referredto as an activation function or kernel.

[0167] The Summation Layer 906 of the Generalized Regression NeuralNetwork 900 (i.e. the third layer) is comprised of two units. The firstunit is called the Numerator 910, and the second unit is called theDenominator 912. Each unit in the Pattern Layer 904 is fully connectedto the Numerator 910 and Denominator 912 units in the Summation Layer906. Both the Numerator 910 and Denominator 912 units are simplesummation units, i.e. they accumulate the input received from all unitsin the Pattern Layer 904. For the Denominator 912 unit, the weightvector is unity, so a simple sum is performed. For the Numerator 910unit, the weight connecting each pattern unit is equal to the value ofthe dependent variable for the training case of that pattern unit (i.e.the output in the input-output pair, or the property in thestructure-property pair).

[0168] The output of the Numerator 910 and Denominator 912 units in theSummation Layer 906 are forwarded to the Output unit 908. The Outputunit 908 divides the output of the Numerator 910 unit by the output ofthe Denominator 912 unit, to compute the output of the network for aparticular input case.

[0169] The activation used by the units in the Pattern Layer 904 istypically a Parzen Window. Parzen Windows is a well known method forestimating a univariate or multivariate probability density functionfrom a random sample. They are described in Parzen, Annals Math. Stat.,33: 1065 (1962), and Cacoullos, Annals Inst. Stat. Meth., 18(2): 179(1966), incorporated herein by reference in their entirety. The ParzenWindow is a weight function w(d) that has its largest value at d=0, anddecreases rapidly as the absolute value of d increases. Examples of suchweight functions are histogram bins, Gaussians, triangular functions,reciprocal functions, etc. If the number of input variables (features)exceeds one, the Parzen Window can involve different scaling parametersfor each input variable. Thus, a Parzen Window can be configured toperform feature scaling in the vicinity of the Training Case on which itis centered. If the Parzen Windows associated with each Training Caseshare common feature weights, the Generalized Regression Neural Network900 is said to be globally weighted. Conversely, if the Parzen Windowsassociated with each Training Case do not share common feature weights,the Generalized Regression Neural Network 900 is said to be locallyweighted.

[0170] Referring back to FIG. 8, a Generalized Regression Neural Network900 can be trained to minimize the prediction error using GradientMinimization 832, Monte-Carlo Sampling 834, Simulated Annealing 836,Evolutionary Programming 838, and/or a Genetic Algorithm 840.Alternatively, the Generalized Regression Neural Network 900 can betrained to minimize the prediction error using a combination of GradientMinimization 832, Monte-Carlo Sampling 834, Simulated Annealing 836,Evolutionary Programming 838, and/or a Genetic Algorithm 840.

[0171] The training process involves adjusting the parameters of theactivation function associated with each unit in the Pattern Layer 904to minimize the mean prediction error for the entire Training Set, orsome other suitable error criterion. During training, the input-outputpairs in the Training Set are presented to the network, and a predictionerror for the entire Training Set is computed. In particular, eachTraining Case is presented to each of the units (Training Cases) in thePattern Layer 904, and the output of these units are summed by the unitsin the Summation Layer 906. The output of the summation units 910 and912 are then divided to compute the output of the network for thatparticular Training Case.

[0172] This process is repeated for each Training Case in the TrainingSet. The parameters of the transfer functions are then adjusted so thatthe prediction error is reduced. This process is repeated until theprediction error for the entire Training Set is minimized, within someprescribed tolerance. Alternatively, the process is repeated for aprescribed number of cycles (as determined by user input, for example),even though the prediction error for the entire Training Set may not beat a minimum, within a prescribed tolerance. Preferably, during thetraining phase, each Training Case is not presented to itself, i.e. theoutput of each Training Case is computed based on every Training Caseother than itself. Thus, it is said that the resulting GeneralizedRegression Neural Network Models 842 are cross-validated, in the sensethat they were designed to resist over fitting.

[0173] If the number of features is large, the Trainer 822 can alsoperform feature selection in addition to scaling (i.e. adjusting theparameters of the transfer functions). Feature selection refers to theprocess of selecting a subset of features, and applying the GeneralizedRegression Neural Network 900 algorithm only on that subset of features.

[0174] For example, in one embodiment, the Generalized Regression NeuralNetwork 900 is trained using a Monte-Carlo Sampling 834 or SimulatedAnnealing 836 algorithm. In this embodiment, an initial model isgenerated at random, by selecting a random set of features andrandomizing the transfer functions associated with each Training Case.

[0175] The model is then gradually refined by a series of smallstochastic ‘steps’. Here, the term ‘step’ is taken to imply a stochastic(random or semi-random) modification of the model's underlyingstructure. For example, the model can be modified by inserting a newfeature, by removing an existing feature, by modifying an existingfeature weight if the model is globally weighted, and/or by modifying arandomly chosen transfer function (i.e. by modifying one or more of theparameters associated with that particular transfer function, such as afeature weight). After the ‘step’ is performed, the quality of theresulting model is assessed, and the new model is compared to the oldmodel using the Metropolis criterion. Alternatively, any other suitablecomparison criterion can be used. If the new model is approved, itreplaces the old model and the process is repeated. If the new model isnot approved, the old model is retained as the current model, and theprocess is repeated.

[0176] This general process is controlled by a Monte-Carlo Samplingprotocol 834, a Simulated Annealing protocol 836, or variants thereof,which are well known to people skilled in the art. However, it should beunderstood that the system of the present invention is not limited tothese embodiments. Alternatively, the Generalized Regression NeuralNetwork 900 can be trained using Evolutionary Programming 838, GeneticAlgorithms 840, or any other suitable search/optimization algorithm. Theimplementation of these methods should be straightforward to personsskilled in the art.

[0177] The training of a Generalized Regression Neural Network 900 usingthe method described above involves (N−1)*(N−1) distance comparisonsduring each optimization cycle, where N is the number of Training Cases.That is, in order to compute the prediction error for the entireTraining Set, each Training Case must be presented to all other (N−1)Training Cases in the network. Thus, it is said that the systemoperating in the manner described above exhibits quadratic timecomplexity.

[0178] For large Training Sets, such as those anticipated in a typicaloperation of the system of the present invention, this process canbecome computationally intractable. To remedy this problem, a preferredembodiment of the system of the present invention uses a hybrid approachthat combines Generalized Regression Neural Networks 900 withK-Nearest-Neighbor classifiers.

[0179] K-Nearest-Neighbor prediction is a well known technique forproperty prediction and classification. It is described in detail inDasarathy, Nearest Neighbor (NN) Norms: NN pattern classificationtechniques, IEEE Computer Society Press, Los Alamitos, Calif. (1991),incorporated herein by reference in its entirety. K-Nearest-Neighborprediction forms the basis of many ‘lazy learning’ algorithms, that arecommonly used in artificial intelligence and control. TheK-Nearest-Neighbor algorithm predicts the output (property) of aparticular input query by retrieving the K nearest (most similar)Training Cases to that query, and averaging their (known) outputsaccording to some weighting scheme. Therefore, the quality ofK-Nearest-Neighbor generalization depends on which Training Cases areconsidered most similar, which is, in turn, determined by the distancefunction.

[0180] In the embodiment described herein, Generalized Regression NeuralNetworks 900 are combined with K-Nearest-Neighbor classifiers, togenerate a hybrid Model Structure 820 referred to hereafter as a NearestNeighbor Generalized Regression Neural Network. The operation of aNearest Neighbor Generalized Regression Neural Network is similar tothat of a regular Generalized Regression Neural Network, except that thequery (input case) is not presented to all Training Cases in the PatternLayer 904. Instead, the query is presented to the K nearest TrainingCases in the Pattern Layer 904, as determined by a suitable distancemetric.

[0181] To accelerate the performance of a Nearest Neighbor GeneralizedRegression Neural Network, the K nearest neighbors are retrieved using anearest neighbor detection algorithm such as a k-d tree (Bentley, Comm.ACM, 18(9): 509 (1975), Friedman et al., ACM Trans. Math. Soft., 3(3):209 (1977)). Alternatively, any other suitable algorithm can be usedincluding, but not limited to, ball trees (Omohundro, InternationalComputer Science Institute Report TR-89-063, Berkeley, Calif. (1989)),bump trees (Omohundro, Advances in Neural Information Processing Systems3, Morgan Kaufmann, San Mateo, Calif. (1991)), gridding, and/or Voronoitesselation (Sedgewick, Algorithms in C, Addison-Wesley, Princeton(1990). The contents of all of the aforementioned publications areincorporated herein by reference.

[0182] The Generalized Regression Neural Network 900 can be trained inmultiple phases using different optimization algorithms (i.e.Monte-Carlo Sampling 834, Simulated Annealing 836, EvolutionaryProgramming 838, and/or Genetic Algorithms 840), and/or different kernelparameters and number of nearest-neighbors during each phase. Forexample, the Generalized Regression Neural Network 900 can be initiallytrained to perform feature detection using Simulated Annealing 836, tennearest neighbors, a uniform kernel (i.e. the same kernel for allTraining Cases), and a common scaling factor for all features. Theresulting (partially optimized) network can then be further refinedusing Gradient Minimization 832 using fifty nearest neighbors, a uniformkernel, and a different scaling factor for each feature. Any number ofphases and training schemes can be used as appropriate.

[0183] As in the functional models and multi-layer perceptrons describedabove, the Trainer 822 can be configured to retain more than oneGeneralized Regression Neural Network Models 842 during the trainingphase (flow arrow 890 in FIG. 8). For example, the Trainer 822 can beconfigured to retain the ten best Generalized Regression Neural NetworkModels 842 discovered during the training phase. Alternatively, theTrainer 822 can be configured to retain the ten best GeneralizedRegression Neural Network Models 842 discovered during training, whichdiffer from each other by some predetermined amount. Again, thedifference between two models can be defined ‘genotypically’ or‘phenotypically’, i.e. by comparing the models based either on theirinternal structure, or their predictions.

[0184] c. Fuzzy Logic Models

[0185] The Structure-Property Model Generator 800 can generateStructure-Property Models 842 based on Fuzzy Logic 806. Fuzzy Logic wasdeveloped by Zadeh (Zadeh, Information and Control, 8: 338 (1965);Zadeh, Information and Control, 12: 94 (1968)) as a means ofrepresenting and manipulating data that is fuzzy rather than precise.The aforementioned publications are incorporated herein by reference intheir entirety.

[0186] Central to the theory of Fuzzy Logic is the concept of a fuzzyset. In contrast to a traditional crisp set where an item either belongsto the set or does not belong to the set, fuzzy sets allow partialmembership. That is, an item can belong to a fuzzy set to a degree thatranges from 0 to 1. A membership degree of 1 indicates completemembership, whereas a membership value of 0 indicates non-membership.Any value between 0 and 1 indicates partial membership. Fuzzy sets canbe used to construct rules for fuzzy expert systems and to perform fuzzyinference.

[0187] Usually, knowledge in a fuzzy system is expressed as rules of theform “if x is A, then y is B”, where x is a fuzzy variable, and A and Bare fuzzy values. Such fuzzy rules are stored in a fuzzy rule base orfuzzy knowledge base describing the system of interest. Fuzzy Logic 806is the ability to reason (draw conclusions from facts or partial facts)using fuzzy sets, fuzzy rules, and fuzzy inference. Thus, followingYager's definition, a fuzzy model is a representation of the essentialfeatures of a system by the apparatus of fuzzy set theory (Yager andFilev, Essentials of Fuzzy Modeling and Control, Wiley (1994)). Theaforementioned publication is incorporated herein by reference in itsentirety.

[0188] Fuzzy Logic 806 has been employed to control complex or adaptivesystems that defy exact mathematical modeling. Applications of fuzzylogic controllers range from cement-kiln process control, to robotcontrol, image processing, motor control, camcorder auto-focusing, etc.However, as of to date, there has been no report on the use of FuzzyLogic 806 for chemical structure-property prediction. A preferredembodiment of a Structure-Property Model Generator 800 using Fuzzy Logic806 shall now be described in detail.

[0189] In one embodiment, the Structure-Property Model Generator 800generates Fuzzy Structure-Property Models 842, i.e. models thatrepresent the essential features of the system using the apparatus offuzzy set theory. In particular, a Fuzzy Structure-Property Model 842makes predictions using fuzzy rules from a fuzzy rule base describingthe system of interest. A fuzzy rule is an IF-THEN rule with one or moreantecedent and consequent variables. A fuzzy rule can besingle-input-single-output (SISO), multiple-input-single-output (MISO),or multiple-input-multiple-output (MIMO). A fuzzy rule base is comprisedof a collection of one or more such fuzzy rules. A MISO fuzzy rule baseis of the form:

[0190] IF x 1 is X 11 AND x 2 is X 12 AND . . . AND Xn is X 1n THEN y isY 1

[0191] ALSO

[0192] IF x 1 is X 21 AND x 2 is X 22 AND . . . AND xn is X 2n THEN y isY 2

[0193] ALSO

[0194] . . .

[0195] ALSO

[0196] IF x 1 is Xr1 AND x 2 is Xr2 AND . . . AND xn is Xrn THEN y isYr,

[0197] where x 1 , . . . , xn are the input variables, y is the output(dependent) variable, and Xij, Yi, i=(1, . . . , r), j=(1, . . . , n)are fuzzy subsets of the universes of discourse of X 1 , . . . , Xn, andY 1 , . . . , Yn, respectively. The fuzzy model described above isreferred to as a linguistic model.

[0198] An example of a fuzzy structure-activity rule is:

[0199] IF molecular weight is high AND logP is low THEN activity is low

[0200] where ‘high’ and ‘low’ are fuzzy sets in the universe ofdiscourse of molecular weight, logP, and activity.

[0201] Alternatively, a Takagi-Sugeno-Kang (TSK) model can be used. ATSK fuzzy rule base is of the form:

[0202] IF x 1 is X 11 AND x 2 is X 12 AND . . . AND xn is X 1n THEN y=b10 +b 11 x 1 + . . . +b 1n xn

[0203] ALSO

[0204] IF x 1 is X 21 AND x 2 is X 22 AND . . . AND xn is X 2n THEN y=b20 +b 21 x 1 + . . . +b 2n xn

[0205] ALSO

[0206] . . .

[0207] ALSO

[0208] IF x 1 is Xr1 AND x 2 is Xr2 AND . . . AND xn is Xrn THEN y=br0+br1 x 1 + . . . +brn xn

[0209] Thus, unlike a linguistic model that involves fuzzy consequents,a TSK model involves functional consequents, typically implemented as alinear function of the input variables.

[0210] Referring to FIG. 10, a Fuzzy Structure-Property Model 1000 isillustrated. In this embodiment, the Fuzzy Knowledge Base 1002 iscomprised of a Rule Base 1004 and a Data Base 1006. The Data Base 1006defines the membership functions of the fuzzy sets used as values foreach system variable, while the Rule Base 1004 is a collection of fuzzyrules of the type described above. The system variables are of two maintypes: input variables and output variables.

[0211] In one embodiment, the input variables in a FuzzyStructure-Activity Model 842 can be molecular features (descriptors).Such molecular features, which collectively represent the EncodedStructure Data 826, can include topological indices, physicochemicalproperties, electrostatic field parameters, volume and surfaceparameters, etc., and their number can range from a few tens to tens ofthousands.

[0212] For example, these features can include, but are not limited to,molecular volume and surface areas, dipole moments, octanol-waterpartition coefficients, molar refractivities, heats of formation, totalenergies, ionization potentials, molecular connectivity indices,substructure keys, hashed fingerprints, atom pairs and/or topologicaltorsions, atom layers, 2D and 3D auto-correlation vectors, 3D structuraland/or pharmacophoric keys, electronic fields, etc.

[0213] If many features are used, Fuzzy Logic 806 can be combined withprinciple component analysis, factor analysis, and/or multi-dimensionalscaling, which are well known techniques for reducing the dimensionalityand extracting the most important features from a large table.

[0214] In one embodiment, the input variables (i.e. the EncodedStructure Data 826, which are usually crisp) are first converted intofuzzy sets by the Fuzzification Unit 1008 using the fuzzy setdefinitions in the Data Base 1006. Then, the Fuzzy Inference Module 1010evaluates all the rules in the Rule Base 1004 to produce the output,using the method described below. In particular, the Fuzzy InferenceModule 1010 performs the following steps:

[0215] (1) determines the degree of match between the fuzzified inputdata and the fuzzy sets defined for the input variables in the Data Base1006;

[0216] (2) calculates the firing strength of each rule based on thedegree of match of the fuzzy sets computed in step 1 and the connectivesused in the antecedent part of the fuzzy rule (i.e. AND, OR, etc.); and

[0217] (3) derives the output based on the firing strength of each rulecomputed in step 2 and the fuzzy sets defined for the output variable inthe Data Base 1006.

[0218] If the Fuzzy Structure-Property Model is a linguistic model, thefuzzy output of the Fuzzy Inference Module 1010 is finally defuzzifiedby the Defuzzification Unit 1012, using the output fuzzy set definitionsin the Data Base 1006, and a defuzzification strategy such as themean-of-maximum method. Alternatively, the center-of-area or any othersuitable deffuzification method can be used.

[0219] Referring back to FIG. 8, the Trainer 822 of the FuzzyStructure-Property Model Generator 800 preferably trains the FuzzyKnowledge Base 1002 using Gradient Minimization 832, Monte-CarloSampling 834, Simulated Annealing 836, Evolutionary Programming 838,and/or a Genetic Algorithm 840, in order to minimize the overallprediction error for a prescribed set of Training Cases. The Trainer 822can use a pre-existing Fuzzy Knowledge Base 1002 or may construct onedirectly from the Structure-Property Data 124, 128. Training is theprocess of creating, modifying and/or refining the fuzzy set definitionsand fuzzy rules in the Fuzzy Knowledge Base 1002.

[0220] For example, in a preferred embodiment, the Fuzzy Knowledge Base1002 is trained using a Monte-Carlo Sampling 834 or Simulated Annealing836 algorithm. In this embodiment, an initial model is generated atrandom, by selecting a random set of rules and randomizing themembership functions associated with each input variable. The model isthen gradually refined by a series of small stochastic ‘steps’. Here,the term ‘step’ is taken to imply a stochastic (random or semi-random)modification of the model's underlying structure.

[0221] For example, the model can be modified by inserting a new rule,by removing an existing rule, by modifying an existing rule (i.e. byinserting or removing a variable from the antecedent part of the fuzzyrule), by modifying the membership function of an existing fuzzy set,and/or by modifying the number of fuzzy partitions of a fuzzy variable(i.e. by increasing or decreasing the number of fuzzy partitions of thefuzzy variable). After the ‘step’ is performed, the quality of theresulting model is assessed, and the new model is compared to the oldmodel using the Metropolis criterion. Alternatively, any other suitablecomparison criterion can be used. If the new model is approved, itreplaces the old model and the process is repeated. If the new model isnot approved, the old model is retained as the current model, and theprocess is repeated.

[0222] This general process is controlled by a Monte-Carlo Samplingprotocol 834, a Simulated Annealing protocol 836, or variants thereof,which are well known to people skilled in the art. However, it should beunderstood that the system of the present invention is not limited tothese embodiments. Alternatively, the Fuzzy Knowledge Base 1002 can betrained using Evolutionary Programming 838, Genetic Algorithms 840, orany other suitable search/optimization algorithm. The implementation ofthese methods should be straightforward to persons skilled in the art.

[0223] As in the functional and neural network models described above,the Trainer 822 can be configured to retain more than one FuzzyStructure-Property Models 842 during the training phase (flow arrow 890in FIG. 8). For example, the Trainer 822 can be configured to retain theten best Fuzzy Structure-Property Models 842 discovered during thetraining phase. Alternatively, the Trainer 822 can be configured toretain the ten best Fuzzy Structure-Property Models 842 discoveredduring training, which differ from each other by some predeterminedamount. Again, the difference between two models can be defined‘genotypically’ or ‘phenotypically’, i.e. by comparing the models basedeither on their internal structure, or their predictions.

[0224] d. Hybrid Models

[0225] The Structure-Property Model Generator 800 can generate ModelStructures 820 that combine elements of Statistics 802, Neural Networks804, Fuzzy Logic 806, and/or Model-Specific Methods 808. Such ModelStructures 820 are referred to as Hybrid Model Structures, and thecorresponding models are referred to as Hybrid Models. A preferredembodiment of such a Hybrid Model Structure 820 that combines elementsof Neural Networks 804 and Fuzzy Logic 806 is referred to as aNeruoFuzzy Model Structure, and shall now be described in detail.

[0226] An example of such a Hybrid Model Structure 820 is a ModelStructure that combines elements of Neural Networks 804 and Fuzzy Logic806, hereafter referred to as a Neuro-Fuzzy Model Structure orNeuro-Fuzzy Model. An example of a Neuro-Fuzzy Model Structure isdiscussed in greater detail below.

[0227] i. Neuro-Fuzzy Models

[0228] A Neuro-Fuzzy Model Structure is a Model Structure 820 thatcombines the advantages of Fuzzy Logic 806 (e.g. human-like rule-basedreasoning, ease of incorporating expert knowledge) and Neural Networks804 (e.g. learning ability, optimization ability, and connectioniststructure). On the neural side, more transparency is obtained bypre-structuring a neural network to improve its performance, or byinterpreting the weight matrix that results from training. On the fuzzyside, the parameters that control the performance of a fuzzy model canbe tuned using techniques similar to those used in neural networksystems. Thus, neural networks can improve their transparency, makingthem closer to fuzzy systems, while fuzzy systems can self-adapt, makingthem closer to neural networks.

[0229] Neuro-Fuzzy systems can be of three main types:

[0230] (1) neural fuzzy systems that use neural networks as tools infuzzy models;

[0231] (2) fuzzy neural networks that fuzzify conventional neuralnetworks; and

[0232] (3) Neuro-Fuzzy hybrid systems that incorporate neural networksand fuzzy systems into hybrid systems.

[0233] Neuro-Fuzzy modeling is reviewed in Lin and Lee, Neural FuzzySystems, Prentice-Hall (1996), incorporated herein by reference in itsentirety.

[0234] One embodiment of a Neuro-Fuzzy Structure-Property Model is aNeural Fuzzy Model with Fuzzy Singleton Rules described in Nomura etal., Proc. IEEE Int. Conf. Fuzzy Syst., 1320, San Diego (1992),incorporated herein by reference in its entirety. The Structure of aNeural Fuzzy Model with Fuzzy Singleton Rules 1100 is shown in FIG. 11.Fuzzy singleton rules are of the form:

[0235] IF x 1 is X 11 AND x 2 is X 12 AND . . . AND xn is X 1n THEN y=w1,

[0236] where x 1 , . . . , xn are the input variables, y is the output(dependent) variable, Xij, i=(1, . . . , m), j=(1, . . . , n) are fuzzysubsets of the universes of discourse of X 1 , . . . , Xn with fuzzymembership functions μxij(xi), and wi is a real number of the consequentpart. If product inference and a centroid defuzzifier are used, theoutput y of such a Neuro-Fuzzy Structure-Property Model 1100 is computedby EQ. 1: $\begin{matrix}{y = \frac{\sum\limits_{i + 1}^{r}{\mu_{i}w_{i}}}{\sum\limits_{i = 1}^{r}\mu_{i}}} & \text{EQ.~~1}\end{matrix}$

[0237] where:

μ_(i)=μ_(1i)(x ₁)μ_(x) _(2i) (X ₂) . . . μ_(x) _(m) (x _(n))  EQ. 2

[0238] Alternatively, the output y can be computed by EQ. 3:$\begin{matrix}{y = {\sum\limits_{i = 1}^{r}{\mu_{i}W_{i}}}} & \text{EQ.~~3}\end{matrix}$

[0239] Referring back to FIG. 8, the Trainer 822 of the Neuro-FuzzyStructure-Property Model Generator 800 preferably trains (i.e.constructs and/or refines) the Neuro-Fuzzy Structure-Property ModelStructure 820 using Gradient Minimization 832, Monte-Carlo Sampling 834,Simulated Annealing 836, Evolutionary Programming 838, and/or a GeneticAlgorithm 840, in order to minimize the overall prediction error for aprescribed set of Training Cases. The Trainer 822 can use a pre-existingNeuro-Fuzzy Structure-Property Model 842 or can construct a new onedirectly from the Structure-Property Data 124, 128. In the preferredembodiment described above (i.e. if the Neuro-Fuzzy Structure-PropertyModel Structure is a Neural Fuzzy Model with Fuzzy Singleton Rules),training is the process of constructing and/or refining the rules,membership functions μxij(xi), and/or the real numbers wi. As intraditional fuzzy systems, the membership functions can be Gaussians,triangular functions, or trapezoidal functions. Alternatively, any othersuitable functional form can be used.

[0240] An example of a training procedure for a Neural Fuzzy Model withFuzzy Singleton Rules based on Gradient Minimization 832 is given inNomura et al., and Lin and Lee, Supra. However, the present invention isnot limited to this embodiment. Alternatively, the Trainer 822 can trainthe Neuro-Fuzzy Structure-Property Model Structure 820 using GradientMinimization 832, Monte-Carlo Sampling 834, Simulated Annealing 836,Evolutionary Programming 838, and/or a Genetic Algorithm 840. Each ofthese methods requires a suitable encoding of the free parameters of themodel, and their implementation should be straightforward to personsskilled in the art.

[0241] Again, the Trainer 822 can be configured to retain more than oneNeuro-Fuzzy Structure-Property Models 842 during the training phase(flow arrow 890 in FIG. 8). For example, the Trainer 822 can beconfigured to retain the ten best Neuro-Fuzzy Structure-Property Models842 discovered during the training phase. Alternatively, the Trainer 822can be configured to retain the ten best Neuro-Fuzzy Structure-PropertyModels 842 discovered during training, which differ from each other bysome predetermined amount. Again, the difference between two models canbe defined ‘genotypically’ or ‘phenotypically’, i.e. by comparing themodels based either on their internal structure, or their predictions.

[0242] e. Model-Specific Methods

[0243] The Structure-Property Model Generator 800 can generate StructureProperty Models 842 based on Model-Specific Methods 808. Model-SpecificMethods 808 refer to methods that are specific to the application domainof the model. Examples of such Model-Specific Methods 808 are methodsthat compute the energy of a particular molecular conformation orreceptor-ligand complex such as an empirical force field or aquantum-mechanical method, methods that align the 3-dimensionalstructures of two or more chemical compounds based on their shape,electronic fields and/or other criteria, methods that predict theaffinity and binding conformation of a ligand to a particular receptorbinding site, methods that construct receptor models based on the3-dimensional structures of known ligands, etc. Examples of suchModel-Specific Methods 808 are described in greater detail below.

[0244] Model-Specific Methods 808 can include methods that take intoaccount the 3-dimensional structures of the chemical compounds and/ortheir biological targets. Such methods are of two main types: dockingmethods and 3D QSAR methods. Examples of such methods that can be usedshall now be described.

[0245] i. Docking Methods

[0246] Docking methods are methods that attempt to predict the bindingconformation between a ligand and a receptor based on their3-dimensional fit, and/or provide an absolute or relative measure of theaffinity of a particular ligand for a particular receptor, based on thequality of their 3-dimensional fit. Docking methods require a3-dimensional model of the receptor (or parts of the receptor), whichcan be determined directly through X-ray crystallography, nuclearmagnetic resonance, or some other 3D structure-determination technique,or indirectly through homology modeling based on the 3-dimensionalstructure of a related receptor, for example.

[0247] Most docking methods reported to date are static in nature. Thatis, a suitable energy function is derived based on an analysis of the3-dimensional structures of known receptor-ligand complexes, and thatenergy function is subsequently used to evaluate the energy of aparticular receptor-ligand binding conformation. The terms ‘energy’ and‘energy function’ are used herein to denote any numerical method forevaluating the quality of the interaction between a ligand and areceptor at a particular binding conformation. Such energy functions areusually combined with a search/optimization method such as GradientMinimization 832, Monte-Carlo Sampling 834, Simulated Annealing 836,Evolutionary Programming 838, and/or a Genetic Algorithm 840, toidentify one or more low energy binding conformations, and to predictthe affinity of a particular ligand for a particular receptor.

[0248] Docking methods are reviewed in Lybrand, Curr. Opin. Struct.Biol. (April 1995), Shoichet et al., Chem. Biol. (March 1996), Lengaueret al., Curr. Opin. Struct. Biol. (June 1996), Willett, TrendsBiotechnol. (1995), and Jackson, Curr. Opin. Biotechnol. (December1995), incorporated herein by reference in their entirety.

[0249] A docking method can be used to derive 3-dimensional structuralmodels of ligands bound to a particular receptor(s), and/or to obtainestimates of the binding affinity of ligands for a particularreceptor(s). In a preferred embodiment, the Analysis Module 118determines the 3-dimensional structures of selected receptor-ligandcomplexes from the Directed Diversity Library 108. Preferably, the3-dimensional structures of the complexes are determined using X-raycrystallography, nuclear magnetic resonance, or some other suitable 3Dstructure-determination technique.

[0250] It is not necessary that every compound in the Directed DiversityLibrary 108 is analyzed by the Analysis Module 118 to derive a3-dimensional receptor map. It should be understood that it is possiblethat none of the compounds in a given Directed Diversity Library 108 ora sequence of Directed Diversity Libraries 108 will be analyzed by theAnalysis Module 118 to obtain a 3-dimensional receptor map. It is alsopossible that every compound in the Directed Diversity Library 108 isanalyzed by the Analysis Module 118 to derive a 3-dimensional receptormap. The determination as to which compounds from the Directed DiversityLibrary 108 will actually be analyzed by the Analysis Module 118 toderive a 3-dimensional receptor map can be determined manually (asspecified by operator input, for example) or automatically by theDirected Diversity Manager 310.

[0251] In one embodiment, the 3D Receptor Map Data 522 (FIG. 5)generated by the 3D Receptor Mapping Module 418 is used by the Trainer822 to train (i.e. construct and/or refine) the energy function that isused by the docking method to evaluate the energy of a particularreceptor-ligand binding conformation. The training of the energyfunction is carried out using Gradient Minimization 832, Monte-CarloSampling 834, Simulated Annealing 836, Evolutionary Programming 838,and/or a Genetic Algorithm 840, so that the prediction error for aprescribed Training Set of 3D Receptor Map Data 522 is minimized. Theprediction error is specified based on the difference between the actualand predicted 3-dimensional structures of the receptor-ligand complexesin the Training Set (such as the RMSD criterion, for example), and/orbased on the difference between the actual and predicted affinities ofthe receptor-ligand complexes in the Training Set. Several energyfunctions and several methods for training such energy functions havebeen reported, and their implementation should be straightforward topersons skilled in the art.

[0252] ii. 3D QSAR Methods

[0253] The Structure-Property Model Generator 800 can also be used togenerate one or more 3D QSAR models. 3D QSAR models are models that arebased on an analysis of the 3-dimensional structures of a series ofligands whose biological activities/properties are known. Unlike dockingmethods, however, 3D QSAR methods do not require knowledge of the3-dimensional structure of the receptor or receptor-ligand complex. 3DQSAR methods are reviewed in Kubinyi (Ed.), 3D QSAR in Drug Design,ESCOM, Leiden (1993), incorporated herein by reference in its entirety.

[0254] In one embodiment, the Structure-Property Model Generator 800generates Structure-Property Models 842 based on one or more 3D QSARmethods. Such 3D QSAR methods include, but are not limited to,pharmacophore identification, structural alignment and molecularsuperposition, molecular shape analysis, mini-receptors andpseudo-receptors, distance geometry, hypothetical active site lattice,and/or molecular interaction fields. Alternatively, any other suitable3D QSAR method can be used.

[0255] Referring back to FIG. 8, a 3D QSAR Model Structure 820 can betrained to minimize the prediction error using Gradient Minimization832, Monte-Carlo Sampling 834, Simulated Annealing 836, EvolutionaryProgramming 838, and/or a Genetic Algorithm 840. Alternatively, the 3DQSAR Model Structure 820 can be trained to minimize the prediction errorusing a combination of Gradient Minimization 832, Monte-Carlo Sampling834, Simulated Annealing 836, Evolutionary Programming 838, and/orGenetic Algorithms 840. The training process involves adjusting the freeparameters of the 3D QSAR Structure-Property Model Structure 820 tominimize the mean prediction error (or some other suitable errorcriterion) for a Training Set of Structure-Property Data 124, 128 withinsome prescribed tolerance. The implementation of such method should bestraightforward to persons skilled in the art.

[0256] As in the functional, Neural Network, Fuzzy, and Neuro-Fuzzymodels described above, the Trainer 822 can be configured to retain morethan one 3D QSAR Models 842 during the training phase (flow arrow 890 inFIG. 8). For example, the Trainer 822 can be configured to retain theten best 3D QSAR Models 842 discovered during the training phase.Alternatively, the Trainer 822 can be configured to retain the ten best3D QSAR Models 842 discovered during training, which differ from eachother by some predetermined amount. Again, the difference between twomodels can be defined ‘genotypically’ or ‘phenotypically’, i.e. bycomparing the models based either on their internal structure, or theirpredictions.

[0257] 4. Experiment Planner 130

[0258] a. Selection Criteria 104

[0259] The Experiment planner 130 can define two general types ofSelection Criteria 104. The first type of Selection Criteria 104represents functions or algorithms that receive a compound and/or a listof compounds from the Compound Library 102, and that return a numericalvalue that represents an individual or collective property of thesecompounds. The second type of Selection Criteria 104 represents specificconstraints and/or methods for generating such lists of compounds. Bothtypes of Selection Criteria 104 are discussed below.

[0260] i. First Type of Selection Criteria 104

[0261] The first type of Selection Criteria 104 represent functions oralgorithms that receive a compound and/or a list of compounds from theCompound Library 102, and return a numerical value that represents anindividual or collective property of these compounds. Examples of suchSelection Criteria 104 that can be used in a preferred embodiment shallnow be described. However, it should be understood that the presentinvention is not limited to this embodiment, and that other suitableSelection Criteria 104 can also be used.

[0262] One such Selection Criterion 104 (referred to hereafter as aCompound Availability Criterion) receives as input a list of compoundsfrom the Compound Library 102, and returns the number or fraction ofthese compounds that are available from the Chemical Inventory 110.

[0263] Another such Selection Criterion 104 (referred to hereafter as aReagent Count Criterion) receives as input a list of compounds from theCompound Library 102, and returns the number of Reagents 114 that mustbe mixed together in the Synthesis Module 112 in order to synthesizethese compounds according to a prescribed synthetic scheme.

[0264] Another such Selection Criterion 104 (referred to hereafter as aReagent Availability Criterion) receives as input a list of compoundsfrom the Compound Library 102, identifies which Reagents 114 must bemixed together in the Synthesis Module 112 in order to synthesize thesecompounds according to a prescribed synthetic scheme, and returns thenumber or fraction of these Reagents 114 that are available from theReagent Inventory 116.

[0265] Another such Selection Criterion 104 (referred to hereafter as aReagent Cost Criterion) receives as input a list of compounds from theCompound Library 102, identifies which Reagents 114 must be mixedtogether in the Synthesis Module 112 in order to synthesize thesecompounds according to a prescribed synthetic scheme, identifies whichof these Reagents 114 need to be purchased from an external source, andreturns the cost of purchasing these Reagents 114 from such an externalsource.

[0266] Another such Selection Criterion 104 (referred to hereafter as aMolecular Diversity Criterion) receives as input a list of compoundsfrom the Compound Library 102, and returns a numerical value thatrepresents the molecular diversity of these compounds. Moleculardiversity refers to the ability of a given set of compounds to exhibit avariety of prescribed structural, physical, chemical and/or biologicalcharacteristics. The field of molecular diversity is reviewed in Martinet al., Reviews in Computational Chemistry, Vol 10, VCH, Weinheim(1977), and Agrafiotis, Encyclopedia of Computational Chemistry, Wiley(in press), incorporated herein by reference in their entirety.

[0267] Molecular diversity is a collective property, and is usuallydefined in a prescribed ‘chemical space’, i.e. in a space defined by aprescribed set of molecular properties or characteristics. Consequently,a diverse collection of compounds in one definition of chemical spacemay not necessarily be diverse in another definition of chemical space.

[0268] A number of methods and algorithms to extract a diverse subset ofcompounds from a larger collection have been reported. Such algorithmsinclude clustering, maximin, stepwise elimination, cluster sampling,d-optimal design, etc. Most of these methods are ‘greedy’ methods thatselect compounds in an incremental manner. The system of the presentinvention represents molecular diversity as a Selection Criterion 104,i.e. as a function or algorithm that receives as input a list ofcompounds, and returns a numerical value that represents the moleculardiversity of these compounds. Moreover, the Diversity Criterion can beused as part of an Objective Function that is used by the Selector 106to select a Directed Diversity Library 108 for the next iteration.

[0269] A preferred embodiment of a Diversity Criterion is given by EQ.4: $\begin{matrix}{{D(S)} = \frac{\sum\limits_{i}^{n}{\underset{j ≢ i}{\min\limits^{n}}\quad d_{ij}}}{{n\left( {n - 1} \right)}/2}} & \text{EQ.~~4}\end{matrix}$

[0270] where S is a set of compounds, D(S) is the diversity of thecompounds in S, n is the number of compounds in S, i, j are used toindex the elements of S, and dij is the distance between the i-th andj-th compounds in S. In a preferred embodiment, the distance dij is aMinkowski metric (e.g. Manhattan distance, Euclidean distance,ultrametric distance, etc.) in a multivariate property space.Preferably, the property space is defined using one or more molecularfeatures (descriptors). Such molecular features can include topologicalindices, physicochemical properties, electrostatic field parameters,volume and surface parameters, etc. For example, these features caninclude, but are not limited to, molecular volume and surface areas,dipole moments, octanol-water partition coefficients, molarrefractivities, heats of formation, total energies, ionizationpotentials, molecular connectivity indices, substructure keys, hashedfingerprints, atom pairs and/or topological torsions, atom layers, 2Dand 3D auto-correlation vectors, 3D structural and/or pharmacophorickeys, electronic fields, etc. Alternatively, the Hamming distance:$\begin{matrix}{d_{ij} = \frac{{{XOR}\left( {x_{i},x_{j}} \right)}}{k}} & \text{EQ.~~5}\end{matrix}$

[0271] Tanimoto coefficient: $\begin{matrix}{{d_{ij} = \frac{{{AND}\left( {x_{i},x_{j}} \right)}}{{{IOR}\left( {x_{i},x_{j}} \right)}}}\text{or~~Dice~~coefficient:}} & \text{EQ.~~6} \\{d_{ij} = \frac{2{{{AND}\left( {x_{i},x_{j}} \right)}}}{{x_{i}} + {x_{j}}}} & \text{EQ.~~7}\end{matrix}$

[0272] or Dice coefficient:

[0273] can be used. In EQ. 5-7, xi and xj represent binary stringsencoding the i-th and j-th structures, respectively (e.g. a substructurekey, pharmacophore key, or hashed fingerprint), k is the length of thebinary sets xi and xj, AND(xi, xj), IOR(xi, xj) and XOR(xi, xj) are thebinary intersection, union (‘inclusive or’) and ‘exclusive or’ of xi andxj, respectively, and |xi| is the number of bits that are ‘on’ in xi.However, the present invention is not limited to these embodiments, andany suitable distance measure and/or definition of chemical space canalternatively be used.

[0274] EQ. 4 exhibits quadratic time complexity, i.e. the time requiredto compute D(S) scales to the square of the number of compounds in theset S. To remedy this problem, in a preferred embodiment, the method canbe combined with a nearest neighbor algorithm such as a k-d tree(Bentley, Comm. ACM, 18(9): 509 (1975), Friedman et al., ACM Trans.Math. Soft., 3(3): 209 (1977)), incorporated herein by reference in itsentirety. Alternatively, any other suitable algorithm can be used,including, but not limited to:

[0275] (1) ball trees (Omohundro, International Computer ScienceInstitute Report TR-89-063, Berkeley, Calif. (1989)), incorporatedherein by reference in its entirety;

[0276] (2) bump trees (Omohundro, Advances in Neural InformationProcessing Systems 3, Morgan Kaufmann, San Mateo, Calif. (1991)),incorporated herein by reference in its entirety; and

[0277] (3) gridding, and Voronoi tesselation (Sedgewick, Algorithms inC, Addison-Wesley, Princeton (1990), incorporated herein by reference inits entirety.

[0278] Another such Selection Criterion 104 (referred to hereafter as aMolecular Similarity Criterion) receives as input a list of compoundsfrom the Compound Library 102 and a list of reference compounds, andreturns a numerical value that represents the molecular similarity ofthese compounds to the reference compounds. In a preferred embodiment,the similarity of a list of compounds to a prescribed set of referencecompounds is computed using EQ. 8: $\begin{matrix}{{M\left( {S,L} \right)} = \frac{\sum\limits_{i = 1}^{n}{\underset{j = i}{\min\limits^{k}}\quad d_{ij}}}{n}} & \text{EQ.~~8}\end{matrix}$

[0279] where S is a set of compounds, L is a set of reference compounds,M(S, L) is the measure of similarity of the compounds in S to thecompounds in L, n is the number of compounds in S, k is the number ofcompounds in L, i and j are used to index the elements of S and L,respectively, and dij is the distance between the i-th compound in S andthe j-th compound in L. Thus, EQ. 8 represents the mean distance of acompound in S from its nearest reference compound in L. In a preferredembodiment, the distance dij is a Minkowski metric (e.g. Manhattandistance, Euclidean distance, ultrametric distance, etc.) in amultivariate property space. Preferably, the property space is definedusing one or more molecular features (descriptors). Such molecularfeatures can include topological indices, physicochemical properties,electrostatic field parameters, volume and surface parameters, etc. Forexample, these features can include, but are not limited to, molecularvolume and surface areas, dipole moments, octanol-water partitioncoefficients, molar refractivities, heats of formation, total energies,ionization potentials, molecular connectivity indices, substructurekeys, hashed fingerprints, atom pairs and/or topological torsions, atomlayers, 2D and 3D auto-correlation vectors, 3D structural and/orpharmacophoric keys, electronic fields, etc. Alternatively, the distancedij can be computed by the Hamming (EQ. 5), Tanimoto (EQ. 6), or Dicecoefficients (EQ. 7) using a binary molecular representation, such as asubstructure key, pharmacophore key, or hashed fingerprint, for example.However, the present invention is not limited to these embodiments, andany suitable definition of chemical space, distance measure, and/orSimilarity Criterion can alternatively be used.

[0280] The set of reference compounds may or may not represent real orsynthesizable compounds. For example, the set of reference compounds canrepresent an ‘ideal’ or ‘target’ set of properties that the selectedcompounds should possess. In this case, the Similarity Criterion in EQ.8 (or any other suitable Similarity Criterion) measures how well aparticular set of compounds matches a prescribed set of targetproperties.

[0281] The Similarity Criterion can be used to design a set of compoundsclose to a reference set of compounds, or to design a set of compoundsfar from a reference set of compounds. For example, if EQ. 8 is used,this can be achieved by simply reversing the sign of D(S, L).

[0282] Another Selection Criterion 104 (referred to hereafter as aSynthetic Confidence Criterion) receives as input a compound (or list ofcompounds) from the Compound Library 102, and returns a confidencefactor that this compound can be synthesized by the Synthesis Module 112using a prescribed synthetic scheme. For example, this confidence factorcan be computed by an expert system for computer-assisted organicsynthesis. However, it should be understood that the present inventionis not limited to this embodiment.

[0283] Another such Selection Criterion 104 (referred to hereafter as aSynthetic Yield Criterion) receives as input a compound (or list ofcompounds) from the Compound Library 102, and returns a predicted yieldfor the compound(s), if the compound(s) were to be synthesized by theSynthesis Module 112 according to a prescribed synthetic scheme. Forexample, the synthetic yield can be computed by an expert system forcomputer-assisted organic synthesis. However, it should be understoodthat the present invention is not limited to this embodiment.

[0284] Another such Selection Criterion 104 (referred to hereafter as aSynthetic Ease or Synthetic Planning Criterion) receives as input a listof compounds from the Compound Library 102, and returns a numericalvalue that represents the ease of planning and executing the synthesisof these in the Synthesis Module 112 according to a prescribed syntheticscheme. For example, one such Synthetic Planning Criterion can be avalue indicating if (and by how much) a particular collection ofcompounds exceeds the synthetic capacity of an automated roboticSynthesis Module 112. Another example of such a Synthetic PlanningCriterion may be the number of different synthetic schemes that must beexecuted by the Synthesis Module 112 in order to synthesize a particularcollection of compounds. However, it should be understood that thepresent invention is not limited to these embodiments.

[0285] Another such Selection Criterion 104 (referred to hereafter as aStructure-Property Model Confirmatory Criterion) receives as input alist of compounds from the Compound Library 102 and a Structure-PropertyModel 842, and returns the mean predicted property (or activity) ofthese compounds, as inferred by the specified model. Alternatively, anyother suitable numerical value that can be derived from the predictedproperties of the specified compounds as inferred by the specifiedStructure-Property Model can be used. For example, theStructure-Property Model Confirmatory Criterion can return the minimumproperty, maximum property, or deviation of properties of the specifiedlist of compounds, as inferred by the specified Structure-PropertyModel. However, it should be understood that the present invention isnot limited to these embodiments. Any form of a Structure-Property Model842 can be used in this regard. For example, the Structure-PropertyModels 842 can include models derived from Statistics 802, NeuralNetworks 804, Fuzzy Logic 806, and/or Model-Specific Methods 808, and/ormodels derived from a combination of Statistics 802, Neural Networks804, Fuzzy Logic 806, and/or Model-Specific Methods 808, such as theNeuro-Fuzzy Structure Property Model 1100 described above, for example.These Structure-Property Models 842 can also include models derived fromdocking methods and/or 3D QSAR methods including, but not limited to,pharmacophore identification, structural alignment and molecularsuperposition, molecular shape analysis, mini-receptors andpseudo-receptors, distance geometry, hypothetical active site lattice,and/or molecular interaction fields. However, it should be understoodthat the present invention is not limited to these embodiments.

[0286] Another such Selection Criterion 104 (referred to hereafter as aStructure-Property Model Discriminatory Criterion) receives as input acompound (or list of compounds) from the Compound Library 102 and two ormore Structure-Property Models 842, and returns a numerical value thatrepresents the ability (or collective ability) of this compound (or listof compounds) to discriminate between the specified models. The term‘discriminate’ is used herein to denote the ability of a compound (orlist of compounds) to distinguish between two or more models. A compoundis said to possess high discriminatory ability if the models differsubstantially in their predictions of the properties of that compound.Structure-Property Model Discriminatory Criteria 104 can be used if theStructure-Property Models 842 are weak or under-determined, for example.In such cases, it is often difficult to select which Structure-PropertyModel(s) 842 should be used to select the Directed Diversity Library 108for the next iteration. Thus, it may be desirable to select compoundsthat can discriminate between two or more Structure-Property Models 842,so that the Structure-Property Models 842 that reflect true correlationsare reinforced, while the Structure-Property Models 842 that do notreflect true correlations are eliminated. An example of aStructure-Property Model Discriminatory Criterion is the differencebetween the minimum and maximum property predictions for a givencompound as inferred by the specified Structure-Property Models 842, orthe deviation of the property predictions for a given compound asinferred by the specified Structure-Property Models 842. However, itshould be understood that the present invention is not limited to theseembodiments. As with Structure-Property Model Confirmatory Criteria 104,any form of a Structure-Property Model 842 can be used in this regard.For example, the Structure-Property Models 842 can include modelsderived from Statistics 802, Neural Networks 804, Fuzzy Logic 806,and/or Model-Specific Methods 808, and/or models derived from acombination of Statistics 802, Neural Networks 804, Fuzzy Logic 806,and/or Model-Specific Methods 808, such as the Neuro-Fuzzy StructureProperty Model 1100 described above, for example. TheseStructure-Property Models 842 can also include models derived fromdocking methods and/or 3D QSAR methods including, but not limited to,pharmacophore identification, structural alignment and molecularsuperposition, molecular shape analysis, mini-receptors andpseudo-receptors, distance geometry, hypothetical active site lattice,and/or molecular interaction fields. However, it should be understoodthat the present invention is not limited to these embodiments.

[0287] Structure-Property Model Discriminatory Criteria can also be usedto determine if a particular compound or list of compounds exhibitsselective properties. For example, Structure-Property ModelDiscriminatory Criteria 104 can be used to determine whether aparticular compound can bind selectively to a specific target (alsoreferred to herein as a Selectivity Criterion). For example, aSelectivity Criterion 104 can be implemented using EQ. 9:$\begin{matrix}{s_{i} = \frac{p_{i}}{\sum\limits_{i}p_{i}}} & \text{EQ.~~9}\end{matrix}$

[0288] where si denotes the selectivity of a particular compound for thei-th property (EQ. 9 assumes that the properties pi are normalized). Forexample, EQ. 9 can be used to describe whether a particular compoundbinds selectively to the enzyme Thrombin versus the enzymes Trypsin andUrokinase, by substituting pi with the binding affinities of thatcompound for Thrombin, Trypsin and Urokinase as predicted by a Thrombin,Trypsin and Urokinase Structure-Property Model 842, respectively. Ifmore than one Structure Property Models 842 are available for aparticular property (or properties), EQ. 9 can be replaced by EQ. 10:$\begin{matrix}{s_{i} = \frac{\underset{j}{mean}\left( p_{ij} \right)}{\sum\limits_{i}{\underset{j}{mean}\left( p_{ij} \right)}}} & \text{EQ.~~10}\end{matrix}$

[0289] where pij is the i-th property of the compound as predicted bythe j-th Structure-Property Model 842, and mean(.) is a function thatreturns the mean of its arguments.

[0290] Another such Selection Criterion 104 (referred to hereafter as aPatentability Criterion) receives as input a compound (or list ofcompounds) from the Compound Library 102, and returns a value indicatingwhether this compound is protected by an issued US or foreign patent.Preferably, the Experiment Planner 130 searches a patent database todetermine is the specified compound (or list of compounds) has beenpatented or has not been patented.

[0291] Another such Selection Criterion 104 (referred to hereafter as aBioavailability Criterion) receives as input a compound (or list ofcompounds) from the Compound Library 102, and returns a value thatrepresents the predicted bioavailability of that compound, as inferredby a suitable Bioavailability Structure-Property Model.

[0292] Another such Selection Criterion 104 (referred to hereafter as aToxicity Criterion) receives as input a compound (or list of compounds)from the Compound Library 102, and returns a value that represents thepredicted toxicity of that compound, as inferred by a suitable ToxicityStructure-Property Model.

[0293] Alternatively, the Experiment Planner 130 can define otherSelection Criteria 104 that can be derived from information pertainingto a given compound or list of compounds, and that can be used to guidethe selection of the Directed Diversity Library 108 for the nextiteration.

[0294] ii. Second Type of Selection Criteria 104

[0295] The second type of Selection Criteria 104 represent specificconstraints and/or methods for generating such lists of compounds. A fewexamples of such Selection Criteria 104 shall now be described.

[0296] One such Selection Criterion 104 defines a list of compounds thatshould not be included in the Directed Diversity Library 108 for thenext iteration (referred to herein as the Excluded Compounds Criterion).For example, these compounds (referred to herein as the ExcludedCompounds) can be compounds whose properties of interest are alreadyknown (e.g. compounds previously analyzed by the Analysis Module 118).Alternatively, the Excluded Compounds can be compounds whose predictedbioavailability as predicted by a Bioavailability Structure-PropertyModel can be below a prescribed threshold, compounds whose predictedtoxicity as predicted by a Toxicity Structure-Property Model can beabove a prescribed threshold, compounds that require expensive Reagents114 to be mixed together in order to be generated by the SynthesisModule 112 (e.g. Reagents 114 whose const exceeds a prescribed value),compounds that cannot be made in an automated or partially automatedfashion by the Synthesis Module 112, etc.

[0297] The Excluded Compounds can also represent combinations ofcompounds that cannot all be part of a Directed Diversity Library 108for the next iteration. For example, the Excluded Compounds can be a setof compounds that require more than one synthetic scheme to be executedby the Synthesis Module 112 in order to be synthesized. For example, ifthe Compound Library 102 is comprised of two or more combinatorialchemical libraries, each of which requires a different synthetic schemeto be executed by the Synthesis Module 112 in order for the compounds inthese libraries to be synthesized, the Excluded Compounds Criterion canbe used to exclude combinations of compounds that cannot all be madeusing a single synthetic scheme, or to limit the selection of compoundsfor the next Directed Diversity Library 108 to a specific combinatoriallibrary (or libraries). Alternatively, the Excluded Compounds canrepresent combinations of compounds that require more than a prescribednumber of Reagents 114 to be mixed together by the Synthesis Module 112in order for these compounds to be synthesized. However, the presentinvention is not limited to these embodiments.

[0298] Another such Selection Criterion 104 defines the number and/orsubset of Reagents 114 that can be mixed together by the SynthesisModule 112. Such a Selection Criterion limits the selection of theDirected Diversity Library 108 for the next iteration to a specificnumber and/or subset of building blocks.

[0299] Another such Selection Criterion 104 defines the way in which theReagents 114 are to be mixed together by the Synthesis Module 112. Forexample, such a Selection Criterion 104 can specify that twenty Reagents114 must be divided into two sets of ten, and these two sets of tenReagents 114 must be mixed together in a combinatorial fashion togenerate all one hundred combinations of a combinatorial library withtwo variable sites (referred to as an Array Design hereafter). However,the present invention is not limited to this embodiment.

[0300] b. Objective Functions 105

[0301] The Experiment planner 130 uses one or more Selection Criteria104 to define one or more Objective Functions 105. The ObjectiveFunction 105 represents a function and/or algorithm that receives a listof compounds from the Compound Library 102 and a list of SelectionCriteria 104, and returns a numerical value that represents a collectiveproperty of the specified compounds.

[0302] Any functional form can be used to implement the ObjectiveFunction 105 and to combine the specified Selection Criteria 104. Forexample, a suitable Objective Function 105 is a linear combination of aprescribed set of Selection Criteria 104, as given by EQ. 11:$\begin{matrix}{{f(S)} = {\sum\limits_{i = 1}^{n}{w_{i}{c_{i}(S)}}}} & \text{EQ.~~11}\end{matrix}$

[0303] where S is a set of compounds, ci(S) is the value of the i-thSelection Criterion 104 for the set S, wi is a weighting factor, andf(S) is the value of the Objective Function 105 for the set of compoundsS. Alternatively, any other suitable functional form can be used.

[0304] An Objective Function 105 might combine, for example, a MolecularDiversity Criterion with a Molecular Similarity Criterion using EQ. 11.In this case, the weights wi determine the relative influence of theMolecular Diversity Criterion and the Molecular Similarity Criterion.For example, when the Molecular Diversity Criterion and MolecularSimilarity Criterion are defined on a similar scale, EQ. 11 can be usedto compute a numerical value that reflects the collective ability of agiven set of compounds S to satisfy both the Molecular DiversityCriterion and Molecular Similarity Criterion under the specified weightswi. Such Objective Functions 105 that combine multiple SelectionCriteria 104 are referred to hereafter as Multi-Objective Functions orMulti-Criteria Functions. Alternatively, an Objective Function 105 caninclude a single Selection Criterion 104. For example, an ObjectiveFunction 105 can simply return the molecular diversity of a collectionof compounds, as computed by a Molecular Diversity Criterion. Examplesof the use of such Objective Functions 105 and Multi-Objective Functions(not shown) to select a Directed Diversity Library 108 for the nextiteration are described below.

[0305] 5. The Selector 106

[0306] The Selector 106 selects a Directed Diversity Library 108 foranalysis, according to the Selection Criteria 104 and any ObjectiveFunctions 105. Preferably, the Directed Diversity Library 108 iscomprised of compounds that are optimal or nearly optimal with respectto the specified Selection Criteria 104 and Objective Functions 105.Moreover, the Directed Diversity Library 108 should be comprised ofcompounds that satisfy any constraints specified by some of theseSelection Criteria 104.

[0307] The task of identifying an optimal or nearly optimal set ofcompounds for the next Directed Diversity Library 108, given theSelection Criteria 104 and Objective Functions 105, involves a search ofall subsets of compounds from the Compound Library 102 that satisfy theconstraints defined by the Experiment Planner 130. As used herein, theterm ‘constraint’ denotes a Selection Criterion 104 that excludescertain compounds or certain combinations of compounds from beingselected as part of the Directed Diversity Library 108 for the nextiteration. Contrast constraints to other Selection Criteria 104, whichspecify desired properties that the selected compounds should possess,either individually or collectively. The Directed Diversity Library 108for the next iteration should satisfy any specified constraints andshould maximize the desired properties, to the extent possible.

[0308] The task of identifying an optimal or nearly optimal set ofcompounds for the next Directed Diversity Library 108 can be an enormouscombinatorial problem. For example, when one Selection Criterion 104limits the selection to an n-membered Compound Library 102, and anotherSelection Criterion 104 specifies that the size of the DirectedDiversity Library 108 for the next iteration should be comprised of kcompounds from the aforementioned n-membered library, the number ofdifferent k-membered subsets of the n-membered library is given by thebinomial: $\begin{matrix}{N = \frac{n!}{{k!}{\left( {n - k} \right)!}}} & \text{EQ.~~12}\end{matrix}$

[0309] This task is combinatorially explosive because, in all but thesimplest cases, N is far too large to allow for the construction andevaluation of every possible subset given current data processingtechnology. As a result, a variety of stochastic modeling techniques canbe employed, that are capable of providing good approximate solutions tocombinatorial problems in realistic time frames. However, the presentinvention envisions and includes the construction and evaluation ofevery individual k-membered subset once computer technology advances toan appropriate point.

[0310] The Selector 106 receives the Selection Criteria 104 andObjective Functions 105 and returns the Directed Diversity Library 108.The Selector 106 preferably uses a stochastic (or exhaustive, ifpossible) search/optimization technique.

[0311] Referring to FIGS. 12 and 13, in one embodiment, the Selector 106is coupled to the Compound Database 134, the Reagent Database 138 andthe Structure-Property Database 126 via dedicated Servers 1204. TheSelector 106 can send a proposed Compound List 1302 the Servers 1204.The Servers 1204 can retrieve property values for the Compound List 1302and return them to the Selector 106 as Values 1304.

[0312] Preferably, the Selector 106 generates an initial list ofproposed compounds based on Selection Criteria 104 and then refines thelist through an iterative process. For example, the Selector 106 canemploy Monte-Carlo Sampling 834, Simulated Annealing 836, EvolutionaryProgramming 838, and/or a Genetic Algorithm 840, to produce a list ofcompounds that best satisfy all the Selection Criteria 104 in the mannerspecified by the Objective Function 105. The list can be refined tobecome the Directed Diversity Library 108 for the next iteration.

[0313] For example, referring to FIG. 13, each Server 1204 can receive aCompound List 1302 from the Selector 106. The Servers 1204 can accessone or more of the databases 126, 134 and 138 to retrieve propertyvalues associated with the compounds in the Compound List 1302, and usethese property values to compute the values of the respective SelectionCriteria 104. The Servers 1204 can return their respective computedvalues as Selection Criteria Values 1304 for Compound List 1302.

[0314] Preferably, the Server 1204 can be configured by user input. Forexample, a user might want to select a particular method for computingmolecular diversity. Similarly, a user might want to select one or moreparticular Structure-Property Models 192 for predicting the propertiesof compounds.

[0315] In one embodiment, the Selector 106 selects the DirectedDiversity Library 108 for the next iteration using a Monte-CarloSampling 834 or Simulated Annealing 836 algorithm. Operation of thisembodiment is described below with reference to FIG. 18.

[0316] 6. Structure of the Present Invention

[0317] A lead generation/optimization system 100 can be implemented as afully automated system or as a partially automated system that relies,in part, on human interaction. For example, human interaction can beemployed to perform or assist in the functions described herein withrespect to the Synthesis Module 112 and/or by the Analysis Module 118and/or the Directed Diversity Manager 310.

[0318] The automated portion of the lead generation/optimization system100 can be implemented as hardware, firmware, software or anycombination thereof, and can be implemented in one or more computersystems and/or other processing systems. In one embodiment, theautomated portion of the invention is directed toward one or morecomputer systems capable of carrying out the functionality describedherein.

[0319] Referring to FIG. 19, an example computer system 1901 includesone or more processors, such as processor 1904. Processor 1904 isconnected to a communication bus 1902. Various software embodiments aredescribed in terms of this example computer system 1901. After readingthis description, it will become apparent to a person skilled in therelevant art how to implement the invention using other computer systemsand/or computer architectures.

[0320] Computer system 1902 also includes a main memory 1906, preferablyrandom access memory (RAM), and can also include a secondary memory1908. Secondary memory 1908 can include, for example, a hard disk drive1910 and/or a removable storage drive 1912, representing a floppy diskdrive, a magnetic tape drive, an optical disk drive, etc. Removablestorage drive 1912 reads from and/or writes to a removable storage unit1914 in a well known manner. Removable storage unit 1914, represents afloppy disk, magnetic tape, optical disk, etc. which is read by andwritten to by removable storage drive 1912. Removable storage unit 1914includes a computer usable storage medium having stored therein computersoftware and/or data.

[0321] In alternative embodiments, secondary memory 1908 can includeother similar means for allowing computer programs or other instructionsto be loaded into computer system 1901. Such means can include, forexample, a removable storage unit 1922 and an interface 1920. Examplesof such can include a program cartridge and cartridge interface (such asthat found in video game devices), a removable memory chip (such as anEPROM, or PROM) and associated socket, and other removable storage units1922 and interfaces 1920 which allow software and data to be transferredfrom the removable storage unit 1922 to computer system 1901.

[0322] Computer system 1901 can also include a communications interface1924. Communications interface 1924 allows software and data to betransferred between computer system 1901 and external devices. Examplesof communications interface 1924 include, but are not limited to amodem, a network interface (such as an Ethernet card), a communicationsport, a PCMCIA slot and card, etc. Software and data transferred viacommunications interface 1924 are in the form of signals which can beelectronic, electromagnetic, optical or other signals capable of beingreceived by communications interface 1924. These signals 1926 areprovided to communications interface via a channel 1928. This channel1928 carries signals 1926 and can be implemented using wire or cable,fiber optics, a phone line, a cellular phone link, an RF link and othercommunications channels.

[0323] In this document, the terms “computer program medium” and“computer usable medium” are used to generally refer to media such asremovable storage device 1912, a hard disk installed in hard disk drive1910, and signals 1926. These computer program products are means forproviding software to computer system 1901.

[0324] Computer programs (also called computer control logic) are storedin main memory and/or secondary memory 1908. Computer programs can alsobe received via communications interface 1924. Such computer programs,when executed, enable the computer system 1901 to perform the featuresof the present invention as discussed herein. In particular, thecomputer programs, when executed, enable the processor 1904 to performthe features of the present invention. Accordingly, such computerprograms represent controllers of the computer system 1901.

[0325] In an embodiment where the invention is implemented usingsoftware, the software can be stored in a computer program product andloaded into computer system 1901 using removable storage drive 1912,hard drive 1910 or communications interface 1924. The control logic(software), when executed by the processor 1904, causes the processor1904 to perform the functions of the invention as described herein.

[0326] In another embodiment, the automated portion of the invention isimplemented primarily in hardware using, for example, hardwarecomponents such as application specific integrated circuits (ASICs).Implementation of the hardware state machine so as to perform thefunctions described herein will be apparent to persons skilled in therelevant art(s).

[0327] In yet another embodiment, the invention is implemented using acombination of both hardware and software.

[0328] Referring to FIG. 3, a lead generation/optimization system 300includes one or more central processing units (CPUs) 302 a, 302 b and302 c, which can be one or more of processors 1904. CPUs 302 operateaccording to control logic 304, 306, and 308, which can be software,firmware, hardware or any combination thereof.

[0329] Processors 302 a, 302 b and 302 c can represent a singleprocessor 302 or can represent multiple processors. Control logic 304,306, and 308 can be executed on a single processor or on multipleprocessors 302.

[0330] Control logic 304, 306, and 308 preferably represent one or morecomputer programs such that the processor 302 operates according tosoftware instructions contained in the control logic 304, 306, and 308.Alternatively, the processor 302 and/or the control logic 304, 306, and308 are implemented as a hardware state machine.

[0331] Processor 302 a and control logic 304 collectively represent theExperiment Planner 130. Processor 302 b and control logic 306collectively represent the Selector 106. Processor 302 and control logic308 collectively represent the Synthesis Protocol Generator 202. TheExperiment Planner 130, the Selector 106, and the Synthesis ProtocolGenerator 202 collectively represent a Directed Diversity Manager 310.

[0332] Directed Diversity Manager 310 can be implemented as part of avariety of computer systems. For example, Directed Diversity Manager 310can be implemented on an Indigo, Indy, Onyx, Challenge, Power Challenge,Octane or Origin 2000 computer made by Silicon Graphics, Inc., ofMountain View, Calif. Another suitable form for the processor 302 is aDEC Alpha Workstation computer made by Digital Equipment Corporation ofMaynard, Mass. Another suitable form for the Processor 302 is one of thePentium family of processors from Intel, such as the Pentium Pro orPentium II. Any other suitable computer system could alternatively beused.

[0333] A Communication Medium 312, comprising one or more data busesand/or IO (input/output) interface devices, connect the ExperimentPlanner 130, the Selector 106, and the Synthesis Protocol Generator 202to a number of peripheral devices, such as one or more Input Devices316, one or more Output Devices 318, one or more Synthesis Modules 112,one or more Analysis Modules 118, and one or more Data Storage Devices314.

[0334] The Input Device(s) 316 receive input (such as data, commands,etc.) from human operators and forward such input to the ExperimentPlanner 130, the Selector 106, and/or the Synthesis Protocol Generator202 via the Communication Medium 312. Any well known, suitable inputdevice can be used in the present invention to receive input, commands,selections, etc., from operators 317, such as a keyboard, pointingdevice (mouse, roller ball, track ball, light pen, etc.), touch screen,voice recognition, etc. User input can also be stored and thenretrieved, as appropriate, from data/command files.

[0335] The Output Device(s) 318 output information to human operators317. The Experiment Planner 130, the Selector 106, and/or the SynthesisProtocol Generator 202 transfer such information to the Output Device(s)318 via the Communication Medium 312. Any well known, suitable outputdevice can be used in the present invention, such as a monitor, aprinter, a floppy disk drive, a text-to-speech synthesizer, etc.

[0336] Preferably, the Synthesis Module 112 receives Robotic SynthesisInstructions 204 (FIG. 2) from the Synthesis Protocol Generator 202 viathe Communication Medium 312. The Synthesis Module 112 operatesaccording to the Robotic Synthesis Instructions 204 to selectivelycombine a particular set of Reagents 114 from the Reagent Inventory 116to thereby generate the compounds from the Directed Diversity Library108 specified by the Selector 106, that are not retrieved from theChemical Inventory 110.

[0337] Where Directed Diversity Manager 310 is implemented as part of acomputer system, Communication Medium 312, Input Device(s) 316 andOutput Device(s) 318 can be an integral part of the computer system.

[0338] The Synthesis Module 112 is preferably a robot capable ofmix-and-split, solid phase chemistry for coupling chemical buildingblocks. As used herein, the term “robot” refers to any automated orpartially automated device that automatically or semi-automaticallyperforms functions specified by instructions such as the RoboticSynthesis Instructions 204 (FIG. 2) generated by the Synthesis ProtocolGenerator 202.

[0339] The Synthesis Module 112 preferably performs selectivemicro-scale solid state synthesis of a specific combinatorial library ofDirected Diversity Library 108 compounds, but is not limited to thisembodiment. The Synthesis Module 112 preferably cleaves and separatesthe compounds of the Directed Diversity Library 108 from support resinand distributes the compounds into preferably 96 wells with from 1 to 20Directed Diversity Library 108 compounds per well, corresponding to anoutput of 96 to 1920 compounds per synthetic cycle iteration, but is notlimited to this embodiment. This function can alternatively be performedby a well known liquid transfer robot (not shown). Synthesis Module(s)suitable for use with the present invention are well known and arecommercially available from a number of manufacturers, such as thefollowing: TABLE 1 Manufacturer City State Model Advanced ChemTechLouisville KY 357 MPS 390 MPS Rainin Woburn MA Symphony Perkin-ElmerCorporation Applied Foster City CA 433A Biosystems Division MilliporeBedford MA 9050 Plus

[0340] All of the instruments listed in Table 1 perform solidsupport-based peptide synthesis only. The Applied Biosystems and theMillipore instruments are single peptide synthesizers. The RaininSymphony is a multiple peptide synthesizer capable of producing up totwenty peptides simultaneously. The Advanced ChemTech instruments arealso multiple peptide synthesizers, but the 357 MPS has a featureutilizing an automated mix-and-split technology. The peptide synthesistechnology is preferred in producing the Directed Diversity Libraries108 associated with the present invention. See, for example, Gallop, M.A. et al., J. Med. Chem. 37, 1233-1250 (1994), incorporated herein byreference in its entirety.

[0341] Peptide synthesis is by no means the only approach envisioned andintended for use with the present invention. Other chemistries forgenerating the Directed Diversity Libraries 108 can also be used. Forexample, the following are suitable: peptoids (PCT Publication No. WO91/19735, Dec. 26, 1991), encoded peptides (PCT Publication WO 93/20242,Oct. 14, 1993), random bio-oligomers (PCT Publication WO 92/00091, Jan.9, 1992), benzodiazepines (U.S. Pat. No. 5,288,514), diversomeres suchas hydantoins, benzodiazepines and dipeptides (Hobbs DeWitt, S. et al.,Proc. Natl. Acad. Sci. USA 90: 6909-6913 (1993)), vinylogouspolypeptides (Hagihara et al., J. Amer. Chem. Soc. 114: 6568 (1992)),nonpeptidal peptidomimetics with a Beta-D-Glucose scaffolding(Hirschmann, R. et al., J. Amer. Chem. Soc. 114: 9217-9218 (1992)),analogous organic syntheses of small compound libraries (Chen, C. etal., J. Amer. Chem. Soc. 116: 2661(1994)), oligocarbamates (Cho, C. Y.et al., Science 261: 1303 (1993)), and/or peptidyl phosphonates(Campbell, D. A. et al., J. Org. Chem. 59: 658 (1994)). See, generally,Gordon, E. M. et al., J. Med. Chem. 37: 1385 (1994). The contents of allof the aforementioned publications are incorporated herein by reference.

[0342] Alternatively, the Synthesis Module 112 can be a robot capable ofsolution-phase synthesis, or a workstation that enables manual synthesisof the compounds in the Directed Diversity Library 108. A number ofwell-known robotic systems have also been developed for solution phasechemistries. These systems include automated workstations like theautomated synthesis apparatus developed by Takeda Chemical Industries,LTD. (Osaka, Japan) and many robotic systems utilizing robotic arms(Zymate II, Zymark Corporation, Hopkinton, Mass.; Orca, Hewlett-Packard,Palo Alto, Calif.) that mimic the manual synthetic operations performedby a chemist. Any of the above devices are suitable for use with thepresent invention. The nature and implementation of modifications tothese devices (if any) so that they can operate as discussed herein willbe apparent to persons skilled in the relevant art.

[0343] It is noted that the functions performed by the Synthesis Module112 can be alternatively performed by human operators, aided or notaided by robots and/or computers.

[0344] The Analysis Module(s) 118 receives the chemical compoundssynthesized by the Synthesis Module(s) 112 or retrieved from theChemical Inventory 110. The Analysis Module(s) 118 analyzes thesecompounds to obtain Structure-Property Data 124 pertaining to thecompounds.

[0345]FIG. 4 is a more detailed structural block diagram of anembodiment of the Analysis Module(s) 118. The Analysis Module(s) 118include one or more Assay Modules 402, such as an Enzyme Activity AssayModule 404, a Cellular Activity Assay Module 406, a Toxicology AssayModule 408, and/or a Bioavailability Assay Module 410. The EnzymeActivity Assay Module 404 assays the compounds synthesized by theSynthesis Module(s) 112 using well known procedures to obtain enzymeactivity data relating to the compounds. The Cellular Activity AssayModule 406 assays the compounds using well known procedures to obtaincellular activity data relating to the compounds. The Toxicology AssayModule 408 assays the compounds using well known procedures to obtaintoxicology data relating to the compounds. The Bioavailability AssayModule 410 assays the compounds using well known procedures to obtainbioavailability data relating to the compounds.

[0346] The Enzyme Activity Assay Module 404, Cellular Activity AssayModule 406, Toxicology Assay Module 408, and Bioavailability AssayModule 410 are implemented in a well known manner to facilitate thepreparation of solutions, initiation of the biological or chemicalassay, termination of the assay (optional depending on the type ofassay) and measurement of the results, commonly using a counting device,spectrophotometer, fluorometer or radioactivity detection device. Eachof these steps can be done manually (with or without the aid of robotsor computers) or by robots, in a well known manner. Raw data iscollected and stored on magnetic media under computer control or inputmanually into a computer. Useful measurement parameters such asdissociation constants or 50% inhibition concentrations can then bemanually or automatically calculated from the observed data, stored onmagnetic media and output to a relational database.

[0347] The Analysis Module(s) 118 optionally include a Structure andComposition Analysis Module 414 to obtain two dimensional structure andcomposition data relating to the compounds. Preferably, the structureand composition analysis module 414 is implemented using a liquidchromatograph device and/or a mass spectrometer. In one embodiment, asampling robot (not shown) transfers aliquots from the 96 wells to acoupled liquid chromatography-mass spectrometry system to perform sampleanalysis.

[0348] The Structure and Composition Analysis Module 414 can be utilizedto determine product composition and to monitor reaction progress bycomparison of the experimental results to the theoretical resultspredicted by the Synthesis Protocol Generator 202. The AnalysisModule(s) 118 can use, but is not limited to, infra-red spectroscopy,decoding of a molecular tag, mass spectrometry (MS), gas chromatography(GC), liquid chromatography (LC), or combinations of these techniques(i.e., GC-MS, LC-MS, or MS-MS). Preferably, the Structure andComposition Analysis Module 414 is implemented using a massspectrometric technique such as Fast Atom Bombardment Mass Spectrometry(FABSMS) or triple quadrapole ion spray mass spectrometry, optionallycoupled to a liquid chromatograph, or matrix-assisted laser desorptionionization time-of-flight mass spectrometry (MALDI-TOF MS). MALDI-TOF MSis well known and is described in a number of references, such as:Brummell et al., Science 264:399 (1994); Zambias et al., TetrahedronLett. 35:4283 (1994), both incorporated herein by reference in theirentireties.

[0349] Liquid chromatograph devices, gas chromatograph devices, and massspectrometers suitable for use with the present invention are well knownand are commercially available from a number of manufacturers, such asthe following: TABLE 2 GAS CHROMATOGRAPHY Manufacturer City State ModelHewlett-Packard Company Palo Alto CA 5890 Varian Associates Inc. PaloAlto CA Shimadzu Scientific Inst. Columbia MD GC-17A Fisons InstrumentsBeverly MA GC 8000

[0350] TABLE 3 LIQUID CHROMATOGRAPHY Manufacturer City State ModelHewlett-Packard Company Palo Alto CA 1050, 1090 Varian Associates Inc.Palo Alto CA Rainin Instrument Co. Woburn MA Shimadzu Scientific Inst.Columbia MD LC-10A Waters Chromatography Milford MA MilleniumPerkin-Elmer Corporation Norwalk CT Hitachi Instruments Inc. San Jose CA

[0351] TABLE 4 MASS SPECTROSCOPY Manufacturer City State ModelHewlett-Packard Company Palo Alto CA Varian Associates Inc. Palo Alto CAKratos Analytical Inc. Ramsey NJ MS80RFAQ Finnigan MAT San Jose CAVision 2000, TSQ-700 Fisons Instruments Beverly MA API LC/MS, AutoSpecPerkin-Elmer Corporation Norwalk CT API-III

[0352] Modifications to these devices may be necessary to fully automateboth the loading of samples on the systems as well as the comparison ofthe experimental and predicted results. The extent of the modificationcan vary from instrument to instrument. The nature and implementation ofsuch modifications will be apparent to persons skilled in the art.

[0353] The Analysis Module(s) 118 can optionally further include aChemical Synthesis Indicia Generator 412 that analyzes the structure andcomposition data obtained by the Structure and Composition AnalysisModule 414 to determine which compounds were adequately synthesized bythe Synthesis Module(s) 112, and which compounds were not adequatelysynthesized by the Synthesis Module(s) 112. In an embodiment, theChemical Synthesis Indicia Generator 412 is implemented using aprocessor, such as Processor 302, operating in accordance withappropriate control logic, such as Control Logic 304, 306, and/or 308.Preferably, the Control Logic 304, 306, and/or 308 represents a computerprogram such that the Processor 302 operates in accordance withinstructions in the Control Logic 304, 306, and/or 308 to determinewhich compounds were adequately synthesized by the Synthesis Module(s)112, and which compounds were not adequately synthesized by theSynthesis Module(s) 112. Persons skilled in the relevant art will beable to produce such Control Logic 304, 306, and/or 308 based on thediscussion of the Chemical Synthesis Indicia Generator 412 containedherein.

[0354] The Analysis Module(s) 118 can also include a three dimensional(3D) Receptor Mapping Module 418 to obtain three dimensional structuredata relating to a receptor binding site. The 3D Receptor Mapping Module418 preferably determines the three dimensional structure of a receptorbinding site empirically through x-ray crystallography and/or nuclearmagnetic resonance spectroscopy, and/or as a result of the applicationof extensive 3D QSAR (quantitative structure-activity relationship) andreceptor field analysis procedures, well known to persons skilled in theart and described in: “Strategies for Indirect Computer-Aided DrugDesign”, Gilda H. Loew et al., Pharmaceutical Research, Volume 10, No.4, pages 475-486 (1993); “Three Dimensional Structure ActivityRelationships”, G. R. Marshall et al., Trends In Pharmceutical Science,9: 285-289 (1988). Both of these documents are herein incorporated byreference in their entireties.

[0355] The functions performed by the Analysis Modules 118 canalternatively be performed by human operators, with or without the aidof robots and/or computers.

[0356] The Analysis Module(s) 118 can additionally include a Physicaland/or Electronic Property Analysis Module(s) 416 that analyzes thecompounds synthesized by the Synthesis Module(s) 112 to obtain physicaland/or electronic property data relating to the compounds. Suchproperties can include water/octanol partition coefficients, molarrefractivity, dipole moment, fluorescence etc. Such properties caneither be measured experimentally or computed using methods well knownto persons skilled in the art.

[0357] Referring again to FIG. 3, the Data Storage Device 314 is aread/write high storage capacity device such as a tape drive unit or ahard disk unit. Data storage devices suitable for use with the presentinvention are well known and are commercially available from a number ofmanufacturers, such as the 2 gigabyte Differential System Disk, partnumber FTO-SD8-2NC, and the 10 gigabyte DLT tape drive, part numberP-W-DLT, both made by Silicon Graphics, Inc., of Mountain View, Calif.The Reagent Database 138, Compound Database 134, and Structure-PropertyDatabase 126 are stored in the Data Storage Device 314.

[0358] The Reagent Database 138 contains information pertaining to thereagents in the Reagent Inventory 116. In particular, the ReagentDatabase 138 contains information pertaining to the chemicalsubstructures, chemical properties, physical properties, biologicalproperties, and electronic properties of the reagents in the ReagentInventory 116.

[0359] The Structure-Property Database 126 stores Structure-PropertyData 124, 128 (FIG. 1) pertaining to the compounds that were synthesizedby the Synthesis Module(s) 112. Such Structure-Property Data 124, 128 isobtained as a result of the analysis of the compounds performed by theAnalysis Module(s) 118, as described above. The Structure-Property Data124, 128 obtained by the Analysis Module(s) 118 is transferred to andstored in the Structure-Property Database 126 via the CommunicationMedium 312.

[0360]FIG. 5 is a more detailed block diagram of an embodiment of theStructure-Property Database 126. The Structure-Property Database 126includes a Structure and Composition Database 502, a Physical andElectronic Properties Database 504, a Chemical Synthesis Database 506, aChemical Properties Database 508, a 3D Receptor Map Database 510, and aBiological Properties Database 512. The Structure and CompositionDatabase 502 stores Structure and Composition Data 514 pertaining tocompounds synthesized by the Synthesis Module(s) 112 and analyzed by theAnalysis Module(s) 118. Similarly, the Physical and ElectronicProperties Database 504, Chemical Synthesis Database 506, ChemicalProperties Database 508, 3D Receptor Map Database 510, and BiologicalProperties Database 512 store Physical and Electronic Properties Data516, Chemical Synthesis Data 518, Chemical Properties Data 520, 3DReceptor Map Data 522, and Biological Properties Data 524, respectively,pertaining to compounds retrieved from the Chemical Inventory 110 and/orsynthesized by the Synthesis Module(s) 112, and analyzed by the AnalysisModule(s) 118. The Structure and Composition Data 514, Physical andElectronic Properties Data 516, Chemical Synthesis Data 518, ChemicalProperties Data 520, 3D Receptor Map Data 522, and Biological PropertiesData 524 collectively represent the Structure-Property Data 124, 128.

[0361] In an embodiment, the Structure and Composition Database 502,Physical and Electronic Properties Database 504, Chemical SynthesisDatabase 506, Chemical Properties Database 508, 3D Receptor Map Database510, and Biological Properties Database 512 each include one record foreach chemical compound retrieved from the Chemical Inventory 110 and/orsynthesized by the Synthesis Module(s) 112 and analyzed by the AnalysisModule(s) 118 (other database structures could alternatively be used).

[0362] 7. Operation of the Present Invention

[0363] The operation of the lead generation/optimization system 100shall now be described in detail with reference to the process flowchart600 of FIG. 6. Steps 602-618 in process flowchart 600 represent apreferred method for identifying chemical compounds having desiredproperties.

[0364] The lead generation/optimization system 100 implements aniterative process where, during each iteration:

[0365] (1) a set of Selection Criteria 104 and/or one or more ObjectiveFunctions are defined (step 602);

[0366] (2) a Directed Diversity Library 108 is selected (step 604);

[0367] (3a) compounds in the Directed Diversity Library 108 areretrieved from the Chemical Inventory 110 (step 606); and/or

[0368] (3b) compounds in the Directed Diversity Library 108 that werenot retrieved from the Chemical Inventory 110 are synthesized (step608);

[0369] (4) the compounds in the Directed Diversity Library 108 areanalyzed to obtain Structure-Property Data 124 pertaining to compounds(step 612);

[0370] (5) the Structure-Property Data 124 are stored in aStructure-Property Database 126 (step 614);

[0371] (6) new Leads 122 are identified and classified (step 616);

[0372] (7) Structure-Property Models with enhanced predictive anddiscriminating capabilities are constructed and/or refined to allow theselection and/or refinement of a new set of Selection Criteria 104 forthe next iteration (step 618).

[0373] In an embodiment, steps 602-618 of flowchart 600 are performedduring each iteration of the iterative process as indicated by controlline 620 in flowchart 600.

[0374] Referring to FIG. 6, the process begins at step 602, where theExperiment Planner 130 defines Selection Criteria 104 and/or one or moreObjective Functions 105. The Experiment Planner 130 defines SelectionCriteria 104 and/or Objective Functions 105 based on currentStructure-Property Data 124 and Historical Structure-Property Data 128.Historical Structure-Property Data 128 can be identified from previousiterations of the lead generation/optimization system 100 and/or fromother independent experiments. The Experiment Planner 130 can alsodefine Selection Criteria 104 and/or Objective Functions 105 based onone or more of: Compound Data 132; Reagent Data 136; Desired Properties120; and Structure-Property Models 192. The Selection Criteria 104and/or Objective Functions 105 are sent to the Selector 106. Additionaldetails of step 602 are provided below, in the description of the nextiteration of the process.

[0375] In step 604, the Selector 106 selects a Directed DiversityLibrary 108. The Selector 106 uses the Selection Criteria 104 and/orObjective Functions 105 that were defined by the Experiment Planner 130in step 603. The Selector 106 can use a stochastic (or exhaustive, ifpossible) search/optimization technique. The search can include, but isnot limited to, Monte-Carlo Sampling 834, Simulated Annealing 836,Evolutionary Programming 838, and/or a Genetic Algorithm 840, to producea list of compounds that best satisfy all the Selection Criteria 104 inthe manner specified by the Objective Function 105, and will comprisethe Directed Diversity Library 108 for the next iteration.

[0376] In one embodiment, the Selector 106 selects the DirectedDiversity Library 108 for the next iteration using a Monte-CarloSampling 834 or Simulated Annealing 836 algorithm. In this embodiment, acollection of compounds that satisfies all the constraints specified bythe Experiment Planner 130 represents a ‘state’, and is encoded in amanner that is most appropriate given those constraints. Thus, theprecise encoding of a state can vary, depending on some of the SelectionCriteria 104 specified by the Experiment Planner 130.

[0377] Referring to the process flowchart of FIG. 18, the process ofstep 604 is illustrated in greater detail for where a Monte-CarloSampling 834 or Simulated Annealing 836 algorithm is used.

[0378] In step 1804, a state , i.e., the collection of compounds thatwill comprise the Directed Diversity Library 108 for the next iteration,is initialized preferably at random. Other initialization approachescould alternatively be used, such as biased or human input. The state isinitialized by selecting a set of compounds and/or a set of reagentspreferably at random.

[0379] In steps 1806-1816, the state is gradually refined by a series ofsmall stochastic ‘steps’. The term ‘step’ means a stochastic (random orpartially random) modification of the state's composition, i.e. thecompounds comprising the state.

[0380] In step 1806, the state is modified. Modification can includesending an randomly generated state to the Server 1204 as Compound List1302 and receiving Values 1304 for the compounds in the Compound List1302. The initial state can then be modified, for example, by replacinga compound currently in the state with a compound not currently in thestate, or by replacing a building block of one or more compoundscurrently in the state. The new state can be sent to the Server 1204 asCompound List 1302 and Values 1304 can be returned for the new state.

[0381] In step 1808, the quality of the new state can be assessed usingthe Objective Function 105 specified by the Experiment Planner 130. Thequality can be assessed by comparing the new state to the old stateusing the Metropolis criterion. Alternatively, any other suitablecomparison criterion can be used.

[0382] In step 1810, if the new state is approved, processing proceedsto step 1812, where the Selector 106 replaces the old state with the newstate. If the new state is not approved, processing proceeds to step1814, where the Selector 106 discards the new state.

[0383] From steps 1812 and 1814, processing proceeds to step 1816, wherethe Selector 106 determines whether to repeat steps 1806-1814 or use thecurrent state as the next Directed Diversity Library 108.

[0384] Steps 1806-1816 can be performed under control of a Monte-CarloSampling protocol 834, a Simulated Annealing protocol 836, or variantsthereof, which are well known to persons skilled in the art. However, itshould be understood that the system of the present invention is notlimited to these embodiments.

[0385] For example, the Selector 106 can use Evolutionary Programming838 or Genetic Algorithms 840, where the population of states (orchromosomes) is initialized at random and is allowed to evolve throughthe repeated application of genetic operators, such as crossover,mutation, and selection. The genetic operators alter the composition ofthe states, either individually (e.g. mutation), or by mixing elementsof two or more states (e.g. crossover) in some prescribed manner.Selection is probabilistic, and is based on the relative fitness ofthese states as measured by the Objective Function 105. As in the caseof Monte-Carlo Sampling 834 and Simulated Annealing 836 described above,the states (or chromosomes) are encoded in a manner that is mostappropriate given the constraints specified by the Experiment Planner130.

[0386] In addition to Evolutionary Programming 838 and GeneticAlgorithms 840, the Selector 106 can also use any other suitablesearch/optimization algorithm to identify the optimal (or a nearlyoptimal) Directed Diversity Library 108.

[0387] Thus, the precise encoding of a state in step 604 can vary,depending on, among other things, the Selection Criteria 104 specifiedby the Experiment Planner 130. The implementation of these methodsshould be straightforward to persons skilled in the art.

[0388] Several examples are provided below to illustrate how one or moreSelection Criteria 104 can be combined by one or more ObjectiveFunctions 105, and how the Selection Criteria 104 and ObjectiveFunctions 105 can be used to select a Directed Diversity Library 108 fora next iteration. These examples are provided to illustrate the presentinvention, not to limit it.

[0389] In the first example, the Selector 106 uses Simulated Annealing836 to identify a set of 50 compounds from a 10,000-membered CompoundLibrary 102 that maximize the Objective Function 105 given by EQ. 13:

ƒ(S)=D(S)  EQ. 13

[0390] using the Molecular Diversity Criterion described in EQ. 4, and aEuclidean distance measure defined in a normalized 2-dimensionalproperty space (in the example below, the properties of these 10,000compounds represent uniformly distributed random deviates in the unitsquare). In a preferred embodiment, the system encodes a state by a pairof index lists, one containing the indices of the compounds currently inthe set (Included Set), and another containing the indices of thecompounds not currently in the set (Excluded Set). A step (i.e. amodification of the composition of the current state) is performed byswapping one or more indices from the Included and Excluded Sets. Thesearch was carried out in 30 temperature cycles, using 1,000 samplingsteps per cycle, an exponential cooling schedule, and the Metropolisacceptance criterion.

[0391] The results of the simulation are shown in FIG. 14, where, as thesimulation progresses, the selected compounds assume an optimaldistribution, i.e. the diversity (spread) of these compounds ismaximized. The set of compounds highlighted in FIG. 14 represent aDirected Diversity Library 108 for the next iteration, selectedaccording to the prescribed Selection Criteria 104 and the ObjectiveFunction 105 in EQ. 13.

[0392] In the second example, the Selector 106 uses Simulated Annealing836 to identify a set of 50 compounds from a 10,000-membered CompoundLibrary 102 that maximize the Objective Function 105 given by EQ. 14:

ƒ(S)=−M(S,L)  EQ. 14

[0393] using the Molecular Similarity Criterion described in EQ. 8, aset of 4 reference compounds (chosen at random), and a Euclideandistance measure defined in a normalized 2-dimensional property space.As in the previous example, the properties of these 10,000 compoundsrepresent uniformly distributed random deviates in the unit square. Thesearch was carried out in 30 temperature cycles, using 1,000 samplingsteps per cycle, an exponential cooling schedule, and the Metropolisacceptance criterion.

[0394] The results of the simulation are shown in FIG. 14. As can beseen from FIG. 15, as the simulation progresses, the selected compoundsassume an optimal distribution, i.e. the selected compounds clustertightly around the specified reference compounds. The set of compoundshighlighted in FIG. 15 represent a Directed Diversity Library 108 forthe next iteration, selected according to the prescribed SelectionCriteria 104 and the Objective Function 105 in EQ. 14.

[0395] In the third example, the Selector 106 uses Simulated Annealing836 to identify a set of 50 compounds from a 10,000-membered CompoundLibrary 102 that maximize the Objective Function 105 given by EQ. 15:

ƒ(S)=2D(S)−M(S,L)  EQ. 15

[0396] using the Molecular Diversity Criterion described in EQ. 4, theMolecular Similarity Criterion described in EQ. 8, a set of 4 referencecompounds (chosen at random), and a Euclidean distance measure definedin a normalized 2-dimensional property space. As in the previousexample, the properties of these 10,000 compounds represent uniformlydistributed random deviates in the unit square. The search was carriedout in 30 temperature cycles, using 1,000 sampling steps per cycle, anexponential cooling schedule, and the Metropolis acceptance criterion.

[0397] EQ. 15 represents a Multi-Objective Function, i.e. an ObjectiveFunction 105 that combines two, rather than one, Selection Criteria 104.The Objective Function 105 in EQ. 15 represents an Objective Function105 that combines molecular diversity and molecular similarity. That is,the Objective Function 105 in EQ. 15 favors solutions that are bothdiverse and focused. The results of the simulation are shown in FIG. 16.As can be seen from FIG. 16, as the simulation progresses, the selectedcompounds assume an optimal distribution, i.e. the selected compoundsbecome both diverse and focused. The set of compounds highlighted inFIG. 16 represent a Directed Diversity Library 108 for the nextiteration, selected according to the prescribed Selection Criteria 104and the Objective Function 105 in EQ. 15.

[0398] In optional steps 606 and 608, compounds specified in theDirected Diversity Library 108 are retrieved or synthesized. Steps 606and 608 are said to be optional because one or both of steps 606 and 608can be performed. In one embodiment, steps 606 and 608 are bothemployed: when compounds specified in the Directed Diversity Library 108were previously synthesized, they are retrieved from a chemicalinventory in step 606 rather than re-synthesized; when compoundsspecified in the Directed Diversity Library 108 were not previouslysynthesized, they are synthesized in step 608. Alternatively, either ofsteps 606 and 608 could be employed exclusively or could be employedwith other methods.

[0399] In optional step 606, the Directed Diversity Manager 310retrieves compounds specified in the Directed Diversity Library 108 thatare available in the Chemical Inventory 110. The Chemical Inventory 110represents any source of available compounds including, but not limitedto, a corporate chemical inventory, a supplier of commercially availablechemical compounds, a natural product collection, etc.

[0400] In one embodiment, the Directed Diversity Manager 310 searchesthe Chemical Inventory 110 to identify and retrieve existing compoundsof the Directed Diversity Library 108. Alternatively, a subset of theDirected Diversity Library 108, as determined by user input, forexample, can be searched for and retrieved from the Chemical Inventory110.

[0401] In optional step 608, the compounds in the Directed DiversityLibrary 108 that were not retrieved from the Chemical Inventory 110 instep 606, are synthesized. In one embodiment, step 608 is performed byone or more are automated robotic Synthesis Modules 112 that receiveRobotic Synthesis Instructions 204 from the Synthesis Protocol Generator202.

[0402] More specifically, the Directed Diversity Manager 310 selectsReagent Data 136 from the Reagent Database 138 and generates RoboticSynthesis Instructions 204. The Reagent Data 136 identifies Reagents 114in the Reagent Inventory 116 that are to be mixed by the one or moreSynthesis Modules 112. The Robotic Synthesis Instructions 204 identifythe manner in which such Reagents 114 are to be mixed. The manner ofmixing can include identifying Reagents 114 to be mixed together, andspecifying chemical and/or physical conditions for mixing, such astemperature, length of time, stirring, etc. The one or more SynthesisModules 112 synthesize compounds in the Directed Diversity Library 108,using selected Reagents 114 from the Reagent Inventory 116, inaccordance with the Robotic Synthesis Instructions 204.

[0403] In another embodiment, optional step 608 is performedsemi-automatically or manually. The chemical compounds that wereretrieved from the Chemical Inventory 110 and/or synthesized by theSynthesis Modules 112 (or synthesized manually) collectively representphysical compounds from a Directed Diversity Library 108.

[0404] In step 612, one or more Analysis Modules 118 analyze thecompounds in the Directed Diversity Library 108 to obtainStructure-Property data 124, pertaining to the compounds. The AnalysisModules 118 receive compounds that were retrieved from the ChemicalInventory 110 in step 606 and compounds that were synthesized by theSynthesis Modules 112 in step 610.

[0405] In one embodiment of step 612, one or more Assay Modules 402 canrobotically assay the chemical compounds in the Directed DiversityLibrary 108 to obtain Physical Properties Data 516, Chemical PropertiesData 520 and Biological Properties Data 524, pertaining to the chemicalcompounds.

[0406] For example, the Enzyme Activity Assay Module 404 can roboticallyassay the chemical compounds using well known assay techniques to obtainenzyme activity data relating to the compounds. Enzyme activity data caninclude inhibition constants Ki, maximal velocity Vmax, etc. TheCellular Activity Assay Module 406 can robotically assay the compoundsusing well known assay techniques to obtain cellular activity datarelating to the compounds. The Toxicology Assay Module 408 canrobotically assay the compounds using well known assay techniques toobtain toxicology data relating to the compounds. The BioavailabilityAssay Module 410 can robotically assay the compounds using well knownassay techniques to obtain bioavailability data relating to thecompounds. The enzyme activity data, cellular activity data, toxicologydata, and bioavailability data represent the Physical Properties Data516, Chemical Properties Data 520 and Biological Properties Data 524.Alternatively, Physical Properties Data 516 can be obtained by thePhysical and Electronic Property Analysis Module 416.

[0407] Also during step 612, the Physical and Electronic PropertiesAnalysis Module 416 can analyze the chemical compounds contained in theDirected Diversity Library 108 to obtain Electronic Properties Data 516pertaining to the chemical compounds. The Electronic Properties Data 516is stored in the Physical and Electronic Properties Database 504 duringstep 614.

[0408] Also during step 612, the 3D receptor mapping module 418 canobtain 3D Receptor Map Data 522 representing the three-dimensionalstructure pertaining to a receptor binding site being tested. The 3DReceptor Mapping Module 418 preferably determines the three-dimensionalstructure of the receptor binding site empirically through X-raycrystallography, nuclear magnetic resonance spectroscopy, and/or as aresult of the application of 3D QSAR and receptor field analysisprocedures. The Receptor Map Data 522 is stored in the Receptor MapDatabase 510 during step 614.

[0409] Also during step 612, an optional Structure and CompositionAnalysis Module 414 can analyze the chemical compounds contained in theDirected Diversity Library 108 to obtain Structure and Composition Data514 pertaining to the chemical compounds. The Structure and CompositionData 514 is stored in the Structure and Composition Database 502 duringstep 614.

[0410] In one embodiment, step 612 is performed robotically, undercontrol of one or more computer programs. Alternatively, step 612 can beperformed manually or by some combination of the two.

[0411] In step 614, the one or more Analysis Modules 118 store theStructure-Property Data 124 obtained in step 612. The Structure-PropertyData 124 can be stored in the Structure-Property Database 126 of theData Storage Device 314. The Structure-Property Database 126 can alsostore Historical Structure-Property Data 128. HistoricalStructure-Property Data 128 can be associated with chemical compoundsthat were synthesized and analyzed in previous iterations by theSynthesis Modules 112 and the Analysis Modules 118, respectively.Historical Structure-Property Data 128 can also include other pertinentStructure-Property Data obtained from independent experiments.

[0412] Using the example from step 612, the Physical Properties Data 516can be stored in the Physical and Electronic Properties Database 504,the Chemical Properties Data 520 can be stored in the ChemicalProperties Database 508 and the Biological Properties Data 524 can bestored in the Biological Properties Database 512.

[0413] In one embodiment of the present invention, during execution ofsteps 612 and 614, a determination is made as to whether a chemicalcompound was adequately synthesized. The determination is made by theAnalysis Modules 118, as shall now be described.

[0414] Referring to FIG. 7, the process begins at step 702, where theStructure and Composition Analysis Module 414 analyzes chemicalcompounds to obtain Structure and Composition Data 514. Preferably, theStructure and Composition Analysis Module 414 analyzes the chemicalcompounds using well known mass spectra analysis techniques.

[0415] In step 704, the Structure and Composition Data 514 is stored ina Structure and Composition Database 502 that forms part of theStructure-Property Database 126.

[0416] In step 706, the Chemical Synthesis Indicia Generator 412retrieves predicted Structure and Composition Data 514 relating to thecompounds. The data is retrieved from the Structure-Property Database126. Preferably, the retrieved data includes predicted mass andstructural data for the compounds.

[0417] In step 708, the Chemical Synthesis Indicia Generator 412compares the measured Structure and Composition Data 514 to thepredicted data to generate Chemical Synthesis Indicia 518. Based on thecomparisons, the Chemical Synthesis Indicia 518 identifies chemicalcompounds that were adequately synthesized and chemical compounds thatwere not adequately synthesized.

[0418] Preferably, during step 708, the Chemical Synthesis IndiciaGenerator 412 compares the measured mass of each compound to thepredicted mass of the compound. If the measured mass and the predictedmass differ by less than a predetermined amount, the Chemical SynthesisIndicia Generator 412 determines that the chemical compound wasadequately synthesized. If the measured mass and the predicted massdiffer by more than the predetermined amount, the Chemical SynthesisIndicia Generator 412 determines that the chemical compound was notadequately synthesized. This predetermined amount can depend on thesensitivity of the instrument used for the structure and compositionanalysis.

[0419] In step 710, the Chemical Synthesis Indicia Generator 412generates Chemical Synthesis Indicia 518 pertaining to the compounds inthe Directed Diversity Library 108, and stores such Chemical SynthesisIndicia 518 in the Chemical Synthesis Database 506. The ChemicalSynthesis Indicia 518 for each compound is a first value (such as “1”)if the compound was adequately synthesized (as determined in step 708),and is a second value (such as “0”) if the compound was not adequatelysynthesized.

[0420] After step 710, control passes to step 616.

[0421] In step 616, the Directed Diversity Manager 310 compares theStructure-Property Data 124, pertaining to the compounds in the DirectedDiversity Library 108, to the Desired Properties 120. The DesiredProperties 120 might have been entered by human operators using theinput device 316, or read from a computer file. The Directed DiversityManager 310 compares the data to determine whether any of the compoundssubstantially conforms to the Desired Properties 120. When a compoundsubstantially conforms to the Desired Properties 120, it can beclassified as a Lead compound 122.

[0422] When an insufficient number of compounds substantially exhibitthe Desired Properties 120, (i.e., an insufficient number of LeadCompounds 122), the compounds can be rated in order to select new Leads122. The Directed Diversity Manager 310 can assign one or more ratingfactors to each compound in the Directed Diversity Library 108, based onhow closely the compound's properties match the Desired Properties 120.The one or more rating factors can be represented by numerical orlinguistic values. Numerical rating factors represent a sliding scalebetween a low value, corresponding to a property profile far from thePrescribed Set of Properties 120, and a high value, corresponding to aproperty profile identical, or very similar, to the Prescribed Set ofProperties 120. Linguistic rating factors can include values such as“poor,” “average,” “good,” “very good,” etc.

[0423] In optional step 618, one or more Structure-Property Models 192are generated and/or refined. Structure-Property Models 192 aregenerated and/or refined to conform to observed Structure-Property Data124 and Historical Structure-Property Data 128. The resultingStructure-Property Models 192 can be used by the Experiment Planner 130and/or the Selector 106 to predict the properties of compounds in theCompound Library 102 whose real properties are hitherto unknown. TheStructure-Property Models can be used by the Experiment Planner 130 todefine and/or refine a set of Selection Criteria 104 that depend uponthe predictions of the Structure-Property Models.

[0424] Referring to the process flowchart of FIG. 17, step 618 shall nowbe described in detail. The process begins at step 1702 where one ormore Model Structures 820 are defined by Structure-Property ModelGenerator 800. The Structure-Property Model Generator 800 can definesModel Structures 820 based on Statistics 802, Neural Networks 804, FuzzyLogic 806, and/or other Model-Specific Methods 808. The Model Structure820 can combine elements of Statistics 802, Neural Networks 804, FuzzyLogic 806, and/or Model-Specific Methods 808. Such Model Structures 820are hereafter referred to as Hybrid Model Structures or Hybrid Models.

[0425] In step 1704, Structure-Property Model Generator 800 receivesStructure-Property Data 124 and 128. Structure-Property Data 124 and 128is separated into Structure Data 824 and Property Data 828.

[0426] In step 1706, Structure Data 824 is encoded as Encoded StructureData 826. Structure Data 824 is encoded in a form that is appropriatefor the particular Model Structure 820.

[0427] In step 1708, Property Data 828 is encoded as Encoded PropertyData 830. Property Data 828 is encoded in a form that is appropriate forthe particular Model Structure.

[0428] In step 1710, the Trainer 822 optimizes, or trains, the ModelStructure 820 that was generated in step 1702. Trainer 822 uses EncodedStructure Data 826, and Encoded Property Data 830 to derive one or moreStructure-Property Models 842. Trainer 822 uses one or more of GradientMinimization 832, Monte-Carlo Sampling 834, Simulated Annealing 836,Evolutionary Programming 838, and/or a Genetic Algorithm 840, dependingupon the type of Structure Model 820 that is being optimized.

[0429] After step 1710, step 618 is complete and control passes back tostep 602 for defining another set of Selection Criteria 104 and/orObjective Functions 105 and then to step 604 for selecting anotherDirected Diversity Library 108 to analyze. The Directed DiversityLibrary 108 for the next iteration can be selected using one or moreSelection Criteria 104, one or more Objective Functions 105, and one ormore selection phases. As used herein, a selection phase refers to asingle run of the Selector 106 using a Monte-Carlo Sampling 834,Simulated Annealing 836, Evolutionary Programming 838, and/or a GeneticAlgorithm 840.

[0430] 8. Conclusions

[0431] The present invention has been described above with the aid offunctional building blocks illustrating the performance of specifiedfunctions and relationships thereof. The boundaries of these functionalbuilding blocks have been arbitrarily defined herein for the convenienceof the description. Alternate boundaries can be defined so long as thespecified functions and relationships thereof are appropriatelyperformed. Any such alternate boundaries are thus within the scope andspirit of the claimed invention. One skilled in the art will recognizethat these functional building blocks can be implemented by discretecomponents, application specific integrated circuits, processorsexecuting appropriate software and the like or any combination thereof.

[0432] While various embodiments of the present invention have beendescribed above, it should be understood that they have been presentedby way of example only, and not limitation. Thus, the breadth and scopeof the present invention should not be limited by any of theabove-described exemplary embodiments, but should be defined only inaccordance with the following claims and their equivalents.

What is claimed is:
 1. A method for identifying chemical compoundshaving desired properties, comprising the steps of: (1) generating afirst set of selection criteria based on one or more desired properties;(2) selecting a first subset of compounds from a library of compoundsbased on the first set of selection criteria; (3) analyzing the firstsubset of compounds; and (4) determining, responsive to said analysis ofstep (3), whether any of the compounds in the first subset of compoundshas one or more properties that are substantially similar to the one ormore desired properties.
 2. The method according to claim 1, furthercomprising the steps of: (5) generating a second set of selectioncriteria based on the one or more desired properties and based on one ormore properties of one or more of the compounds in the first subset ofcompounds; and (6) selecting a second subset of compounds from thelibrary of compounds based on the second set of selection criteria; (7)analyzing the second subset of compounds; and (8) determining,responsive to said analysis of step (7), whether any of the compounds inthe second subset of compounds has one or more properties that aresubstantially similar to the one or more desired properties.
 3. Themethod according to claim 1, wherein step (1) comprises the steps of:(a) generating one or more structure-property models that predictproperties of compounds; and (b) training the one or morestructure-property models to minimize error between predicted propertiesand actual properties.
 4. The method according to claim 3, wherein step(1)(a) comprises the step of: (i) generating at least one neural networkstructure-property model.
 5. The method according to claim 3, whereinstep (1)(a) comprises the step of: (i) generating at least oneNeuro-Fuzzy structure-property model based on neural networks and fuzzylogic.
 6. The method according to claim 3, wherein step (1)(a) comprisesthe step of: (i) generating at least one generalized regression neuralnetwork structure-property model that employs K-nearest-neighborclassifiers.
 7. The method according to claim 3, wherein step (1)(b)comprises training the one or more structure-property models using oneor more of the following techniques: (i) gradient minimization; (ii)Monte Carlo; (iii) simulated annealing; (iv) evolutionary programming;and (v) genetic algorithms.
 8. The method according to claim 1, whereinstep (1) comprises the step of: (a) generating one or more objectivefunctions from the first set of selection criteria, each objectivefunction specifying a collection of selection criteria that a selectedcompound should exhibit.
 9. The method according to claim 1, whereinstep (2) comprises the steps of: (a) selecting an initial set of one ormore compounds; (b) assessing the initial set of one or more compounds;(c) modifying the initial set of one or more compounds to generate a newset of one or more compounds; (d) assessing the new set of one or morecompounds; (e) replacing the initial set of one or more compounds withthe new set of one or more compounds when the new set of one or morecompounds is determined to be better than the initial set of one or morecompounds; and (f) repeating steps (1)(a)-(1)(e) a number of times; and(g) outputting a set of compounds as the first subset of compounds. 10.The method according to claim 1, wherein step (2) comprises selecting afirst subset of compounds using one or more of the following techniques:(a) Monte Carlo; (b) simulated annealing; (c) evolutionary programming;and (d) genetic algorithms.